Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 1: Zach Corse #11

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,63 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zach Corse
* LinkedIn: https://www.linkedin.com/in/wzcorse/
* Personal Website: https://wzcorse.com
* Twitter: @ZachCorse
* Tested on: Windows 10, i7-6700HQ @ 2.60GHz 32GB, NVIDIA GeForce GTX 970M (personal computer)

### (TODO: Your README)
## README

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
![gif](images/flocking.gif)

Introduction
------------

In this project I implement a standard flocking algorithm but do so on the GPU. Parallelization of flocking is possible because boid trajectory changes can be computed independently using shared position and velocity memory buffers.

A naive approach to flocking would be to compare each fish or bird (aka boid) to the other N-1 boids in the simulation. However, only the boid's nearest neighbors influence its trajectory. Therefore, this simulation includes a uniform grid optimization, in which boids are binned into cells. Each boid is then only compared to boids inside its nearest neighboring cells. I implement two nearest neighbor searches. The first assumes that each cell is twice the radius of the boid's largest flocking distance measure. This means that, in general, eight cells are scanned for neighbors in each kernel. The second assumes that each cell is equal to the radius of the boid's largest flocking distance measure. This means that, in general, 27 smaller cells are scanned for neighbors in each kernel.

I add an additional optimization to the uniform grid described above. After boids are binned into grid cells, this simulation sorts these grid cells by index and boid label simultaneously, such that by querying the index of this sorted array in a kernel, one can access a grid cell and simultaneosly ask which boid is in that grid cell. From a boid's index, one can query a separate array maintaining boid positions and velocities. The additional optimization instead sorts boid positions and velocities directly as it sorts grid cell indices, thereby eliminating the need to access an intermediary boid label array. Consider this a coherent search on top of a uniform grid.


Performance as a Function of Boid Count
------------

The most basic question one may ask about this simulation's performance is how frames per second (FPS) scales with boid (N) count. Presumably, the more boids one adds to a simulation, the lower the FPS you would expect. This is true of the naive (No Grid) approach, but the coherent and uniform grid cases demonstrate slightly different behavior than expected.

![graph1](images/PerformanceFPSasaFunctionofBoidCountVIS128TBB.png)

As the above graph indicates, there is a dip in performance when N=5000, meaning that performance actually increases with boid count immediately afterwards. This may be a consequence of how memory is allocated on the GPU, but this distribution doesn't appear to vary with GPU blocksize. Additionally, we see a marginal boost in performance using a coherent grid over a uniform grid.

We might expect this performance boost based on the following argument: The uniform grid approach requires two reads from global memory--one to read from a boid index array, and other to access each boid's position and velocity. Additionally, position and velocity memory reads will not be contiguous in this case, since we might be accessing any boid's data (i.e., boids won't necessarily be grouped by index). On the other hand, the coherent approach requires one read from globaly memory (position and velocity) and an additional copy kernel. The read will be over contiguous memory however. One less read from global memory that is over contiguous memory should account for the observed performance boost.

Grid Type Comparison with and without Visualization
------------

As expected, turning off the simulation visualization dramatically increases sim FPS. Coherent and Uniform grid algorithms see a nearly 2X boost in performance. The naive approach only marginally improves in performance.

![graph2](images/GridFPSComparison128TBB.png)

Comparing Neighborhood Search Algorithms
------------

As noted above, I implemented two different neighbor search algorithms. The first will sample the nearest 8 cells (where cell width is twice the radius of boid influence) and the second samples the nearest 27 cells (where cell width is equal to the radius of boid influence). My sampling technique for the former method uses a fast approach in which cells are overindexed by a factor of two. These indices in each dimension, modulo 2, then indicate the quadrant within a cell in which a boid resides, permitting the easy computation of its nearest 8 neighbors.

It would difficult to guess which of these two neighbor searches would yield a higher FPS. While 8 < 27, the cells are larger and may therefore contain more boids. The graph below indicates that we see a slight performance boost using larger cells but fewer of them.

![graph3](images/PerformanceFPSasaFunctionofCellWidthVIS5KBoids128TBB.png)

Evaluating the Effects of Blocksize (TBB)
------------

Varying memory blocksize (threads per block) does not appear to significantly alter performance (for N=5000). Any variation seen below is within the normal variation observed between runs.

![graph4](images/FPSvsTBBVIS5000Boids.png)

Results & Flocking Behavior
------------

As shown below, boids appear to flock like real fish, birds, and bats!

![screenshot](images/screenshot.PNG)
Binary file added images/FPSvsTBBVIS5000Boids.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/Grid FPSComparison128TBB.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/GridFPSComparison128TBB.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/flocking.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/screenshot.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_50
)
Loading