Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project1 : Anantha Srinivas #14

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 45 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,48 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**
Project 1 Flocking
====================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture**

**Anantha Srinivas**
[LinkedIn](https://www.linkedin.com/in/anantha-srinivas-00198958/), [Twitter](https://twitter.com/an2tha)

**Tested on:**
* Windows 10, i7-8700 @ 3.20GHz 16GB, GTX 1080 8097MB (Personal)
* Built for Visual Studio 2017 using the v140 toolkit

Implementation
---

This is the simulation of 100K boids at 350 Frames per second. This shows semi-coherent memory access uniform grid flocking.

![](images/boid9.gif)

---

__Performance Graph__
![](images/performance_graph.PNG)

Experimental setup
---
* Tested on a GTX 1080, with compute capability of 6.1
* NVIDIA vertical Sync was disabled (which actually increased the frame rate).
* Visualization was turned off and program run in Release mode.

Interesting Insights
---
* Even though semi-coherent memory access involves an extra step in making sure that the position and velocity data are contiguous, it does run faster than regular unifrorm grid search method. This can probably be attributed to caching and faster memory access in GPU.

* I noticed that uniform grid search is not best throughout. Although it does perform way better than brute force searching, the frame rate does not linearly increase withnumber of boid. The frame rate actually peaks somewhere near 10k boids (which is suprisingly faster than 5k boids) suggesting that this method of searching needs to have some level of saturation in the data.

Further thoughts
---
* **For each implementation, how does changing the number of boids affect performance? Why do you think this is?** As a general rule increasing the number of boids decreases the framerate across all methods. However, there were some anomolies as noted above. For uniform and coherent memory access methods, the frame rate actually peaked for 10k boids. The frame rate generally decreases because the number of threads to be launched increases, which further increases the GPU load. As a general rule of thumb only around 60 ~ 80% of GPU should be loaded to get best performance.

* **For each implementation, how does changing the block count and block size affect performance? Why do you think this is?** This did not affect the performance much. This may be because the default block size of 128 is much smaller than the boid size. Also, block is a logical contruct, so it should technically not matter if all the threads are running the same instructions.

* **For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?** Yes, Coherent grid, resulted in a slightly better performance. This could be due to easier/faster memory access to the GPU. It could also possibly cache the data since they are contiguous.

* **Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!** It doesnt matter to a certain extend on what is the cell width. All the boids have to be covered in some iteration.

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)

### (TODO: Your README)

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Binary file added images/boid9.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/performance_graph.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_61
)
Loading