diff --git a/README.md b/README.md
index d63a6a1..af2ad12 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,63 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Zach Corse
+  * LinkedIn: https://www.linkedin.com/in/wzcorse/
+  * Personal Website: https://wzcorse.com
+  * Twitter: @ZachCorse
+* Tested on: Windows 10, i7-6700HQ @ 2.60GHz 32GB, NVIDIA GeForce GTX 970M (personal computer)
 
-### (TODO: Your README)
+## README
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![gif](images/flocking.gif)
+
+Introduction
+------------
+
+In this project I implement a standard flocking algorithm but do so on the GPU. Parallelization of flocking is possible because boid trajectory changes can be computed independently using shared position and velocity memory buffers.
+
+A naive approach to flocking would be to compare each fish or bird (aka boid) to the other N-1 boids in the simulation. However, only the boid's nearest neighbors influence its trajectory. Therefore, this simulation includes a uniform grid optimization, in which boids are binned into cells. Each boid is then only compared to boids inside its nearest neighboring cells. I implement two nearest neighbor searches. The first assumes that each cell is twice the radius of the boid's largest flocking distance measure. This means that, in general, eight cells are scanned for neighbors in each kernel. The second assumes that each cell is equal to the radius of the boid's largest flocking distance measure. This means that, in general, 27 smaller cells are scanned for neighbors in each kernel.
+
+I add an additional optimization to the uniform grid described above. After boids are binned into grid cells, this simulation sorts these grid cells by index and boid label simultaneously, such that by querying the index of this sorted array in a kernel, one can access a grid cell and simultaneosly ask which boid is in that grid cell. From a boid's index, one can query a separate array maintaining boid positions and velocities. The additional optimization instead sorts boid positions and velocities directly as it sorts grid cell indices, thereby eliminating the need to access an intermediary boid label array. Consider this a coherent search on top of a uniform grid.
+
+
+Performance as a Function of Boid Count
+------------
+
+The most basic question one may ask about this simulation's performance is how frames per second (FPS) scales with boid (N) count. Presumably, the more boids one adds to a simulation, the lower the FPS you would expect. This is true of the naive (No Grid) approach, but the coherent and uniform grid cases demonstrate slightly different behavior than expected.
+
+![graph1](images/PerformanceFPSasaFunctionofBoidCountVIS128TBB.png)
+
+As the above graph indicates, there is a dip in performance when N=5000, meaning that performance actually increases with boid count immediately afterwards. This may be a consequence of how memory is allocated on the GPU, but this distribution doesn't appear to vary with GPU blocksize. Additionally, we see a marginal boost in performance using a coherent grid over a uniform grid.
+
+We might expect this performance boost based on the following argument: The uniform grid approach requires two reads from global memory--one to read from a boid index array, and other to access each boid's position and velocity. Additionally, position and velocity memory reads will not be contiguous in this case, since we might be accessing any boid's data (i.e., boids won't necessarily be grouped by index). On the other hand, the coherent approach requires one read from globaly memory (position and velocity) and an additional copy kernel. The read will be over contiguous memory however. One less read from global memory that is over contiguous memory should account for the observed performance boost.
+
+Grid Type Comparison with and without Visualization
+------------
+
+As expected, turning off the simulation visualization dramatically increases sim FPS. Coherent and Uniform grid algorithms see a nearly 2X boost in performance. The naive approach only marginally improves in performance.
+
+![graph2](images/GridFPSComparison128TBB.png)
+
+Comparing Neighborhood Search Algorithms
+------------
+
+As noted above, I implemented two different neighbor search algorithms. The first will sample the nearest 8 cells (where cell width is twice the radius of boid influence) and the second samples the nearest 27 cells (where cell width is equal to the radius of boid influence). My sampling technique for the former method uses a fast approach in which cells are overindexed by a factor of two. These indices in each dimension, modulo 2, then indicate the quadrant within a cell in which a boid resides, permitting the easy computation of its nearest 8 neighbors. 
+
+It would difficult to guess which of these two neighbor searches would yield a higher FPS. While 8 < 27, the cells are larger and may therefore contain more boids. The graph below indicates that we see a slight performance boost using larger cells but fewer of them.
+
+![graph3](images/PerformanceFPSasaFunctionofCellWidthVIS5KBoids128TBB.png)
+
+Evaluating the Effects of Blocksize (TBB)
+------------
+
+Varying memory blocksize (threads per block) does not appear to significantly alter performance (for N=5000). Any variation seen below is within the normal variation observed between runs.
+
+![graph4](images/FPSvsTBBVIS5000Boids.png)
+
+Results & Flocking Behavior
+------------
+
+As shown below, boids appear to flock like real fish, birds, and bats! 
+
+![screenshot](images/screenshot.PNG)
diff --git a/images/FPSvsTBBVIS5000Boids.png b/images/FPSvsTBBVIS5000Boids.png
new file mode 100644
index 0000000..7108489
Binary files /dev/null and b/images/FPSvsTBBVIS5000Boids.png differ
diff --git a/images/Grid FPSComparison128TBB.png b/images/Grid FPSComparison128TBB.png
new file mode 100644
index 0000000..f31c9dd
Binary files /dev/null and b/images/Grid FPSComparison128TBB.png differ
diff --git a/images/GridFPSComparison128TBB.png b/images/GridFPSComparison128TBB.png
new file mode 100644
index 0000000..f31c9dd
Binary files /dev/null and b/images/GridFPSComparison128TBB.png differ
diff --git a/images/PerformanceFPSasaFunctionofBoidCountVIS128TBB.png b/images/PerformanceFPSasaFunctionofBoidCountVIS128TBB.png
new file mode 100644
index 0000000..d3f6507
Binary files /dev/null and b/images/PerformanceFPSasaFunctionofBoidCountVIS128TBB.png differ
diff --git a/images/PerformanceFPSasaFunctionofCellWidthVIS5KBoids128TBB.png b/images/PerformanceFPSasaFunctionofCellWidthVIS5KBoids128TBB.png
new file mode 100644
index 0000000..0ff5db8
Binary files /dev/null and b/images/PerformanceFPSasaFunctionofCellWidthVIS5KBoids128TBB.png differ
diff --git a/images/flocking.gif b/images/flocking.gif
new file mode 100644
index 0000000..d1a40b4
Binary files /dev/null and b/images/flocking.gif differ
diff --git a/images/screenshot.PNG b/images/screenshot.PNG
new file mode 100644
index 0000000..5adc65e
Binary files /dev/null and b/images/screenshot.PNG differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..dff0113 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..9a3a109 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -6,6 +6,11 @@
 #include "utilityCore.hpp"
 #include "kernel.h"
 
+#define cellWidthTwoX
+#ifndef cellWidthTwoX
+#define cellWidthOneX
+#endif
+
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
@@ -76,6 +81,7 @@ glm::vec3 *dev_vel2;
 // For efficient sorting and the uniform grid. These should always be parallel.
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
+
 // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
@@ -83,6 +89,9 @@ thrust::device_ptr<int> dev_thrust_particleGridIndices;
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
+glm::vec3 *dev_shufflePos; // shuffled position values according to array index
+glm::vec3 *dev_shuffleVel; // shuffled velocity values according to array index
+
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
 
@@ -138,6 +147,7 @@ __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, flo
 */
 void Boids::initSimulation(int N) {
   numObjects = N;
+  
   dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
 
   // LOOK-1.2 - This is basic CUDA memory management and error checking.
@@ -158,6 +168,9 @@ void Boids::initSimulation(int N) {
 
   // LOOK-2.1 computing grid params
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#ifdef cellWidthOneX
+  gridCellWidth /= 2.0;
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,10 +182,32 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  //cudaDeviceSynchronize();
+
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_thrust_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_thrust_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_shufflePos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shufflePos failed!");
+
+  cudaMalloc((void**)&dev_shuffleVel, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shuffleVel failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
-
 /******************
 * copyBoidsToVBO *
 ******************/
@@ -229,11 +264,49 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
+
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 retVel		= glm::vec3(0, 0, 0);	// returned velocity change
+	glm::vec3 com			= glm::vec3(0, 0, 0);	// center of mass
+	glm::vec3 c				= glm::vec3(0, 0, 0);	// speed change to avoid collision
+	glm::vec3 perceivedVel	= glm::vec3(0, 0, 0);	// perceived local velocity
+	int neighbors_1 = 0;	// number of neighbors used in rule 1 test
+	int neighbors_3 = 0;	// number of neighbors used in rule 3 test
+
+	for (int i = 0; i < N; ++i)
+	{
+		if (i != iSelf) {
+			float relativeDist = glm::distance(pos[i], pos[iSelf]);
+			// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+			if (relativeDist < rule1Distance) {
+				neighbors_1++;
+				com += pos[i];
+			}
+			// Rule 2: boids try to stay a distance d away from each other
+			if (relativeDist < rule2Distance) {
+				c -= (pos[i] - pos[iSelf]);
+			}
+			// Rule 3: boids try to match the speed of surrounding boids
+			if (relativeDist < rule3Distance) {
+				neighbors_3++;
+				perceivedVel += vel[i];
+			}
+		}
+	}
+	// Rule 1
+	if (neighbors_1 > 0) {
+		com /= neighbors_1;
+		retVel += (com - pos[iSelf]) * rule1Scale;
+	}
+	// Rule 2
+	retVel += c * rule2Scale;
+	// Rule 3
+	if (neighbors_3 > 0) {
+		perceivedVel /= neighbors_3;
+		retVel += perceivedVel * rule3Scale;
+	}
+
+	return retVel;
 }
 
 /**
@@ -242,9 +315,18 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVel = computeVelocityChange(N, index, pos, vel1);
+	vel2[index] = vel1[index] + newVel;
+	// Clamp the speed
+	if (glm::length(vel2[index]) > maxSpeed) {
+		vel2[index] = maxSpeed * glm::normalize(vel2[index]);
+	}
 }
 
 /**
@@ -252,11 +334,11 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
   if (index >= N) {
     return;
   }
+  // Update position by velocity
   glm::vec3 thisPos = pos[index];
   thisPos += vel[index] * dt;
 
@@ -279,16 +361,30 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //            for(y)
 //             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+	if (x < 0 || x >= gridResolution || y < 0 || y >= gridResolution || z < 0 || z >= gridResolution) {
+		return -1;
+	}
+	return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	// TODO-2.1
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// Label each boid with the index of its grid cell.
+	glm::vec3 boidPos = (pos[index] - gridMin) * inverseCellWidth;
+	int cell_i = (int)(boidPos[0]);
+	int cell_j = (int)(boidPos[1]);
+	int cell_k = (int)(boidPos[2]);
+	int gridIndex1D = gridIndex3Dto1D(cell_i, cell_j, cell_k, gridResolution);
+	gridIndices[index] = gridIndex1D;
+
+	// Set up a parallel array of integer indices as pointers to the actual
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +402,21 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	if (index == 0) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+		return;
+	}
+	if (index == N - 1) {
+		gridCellEndIndices[particleGridIndices[index]] = index;
+	}
+	if (particleGridIndices[index] != particleGridIndices[index - 1]) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+		gridCellEndIndices[particleGridIndices[index - 1]] = index - 1;
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -317,11 +428,122 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
   // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// Label each boid with the index of its grid cell.
+	glm::vec3 boidCell = (pos[particleArrayIndices[index]] - gridMin) * inverseCellWidth;
+	int cell_i = (int)(boidCell[0]);
+	int cell_j = (int)(boidCell[1]);
+	int cell_k = (int)(boidCell[2]);
+	int boidCellIndex = gridIndex3Dto1D(cell_i, cell_j, cell_k, gridResolution);
+
+	// Identify which cells may contain neighbors. This isn't always 8.
+#ifdef cellWidthTwoX
+	int cell_2i = (int)(boidCell[0] * 2.0);
+	int cell_2j = (int)(boidCell[1] * 2.0);
+	int cell_2k = (int)(boidCell[2] * 2.0);
+
+	cell_2i = cell_2i % 2;
+	cell_2j = cell_2j % 2;
+	cell_2k = cell_2k % 2;
+
+	int left_right = cell_2i == 0 ? -1 : 1;
+	left_right += cell_i;
+	int up_down = cell_2j == 0 ? -1 : 1;
+	up_down += cell_j;
+	int in_out = cell_2k == 0 ? -1 : 1;
+	in_out += cell_k;
+
+	const int M = 8;
+
+	int neighbors[M];
+	// arranged such that access is contiguous 
+	neighbors[0] = gridIndex3Dto1D(cell_i, cell_j, cell_k, gridResolution);
+	neighbors[1] = gridIndex3Dto1D(left_right, cell_j, cell_k, gridResolution);
+	neighbors[2] = gridIndex3Dto1D(cell_i, up_down, cell_k, gridResolution);
+	neighbors[3] = gridIndex3Dto1D(left_right, up_down, cell_k, gridResolution);
+	neighbors[4] = gridIndex3Dto1D(cell_i, cell_j, in_out, gridResolution);
+	neighbors[5] = gridIndex3Dto1D(left_right, cell_j, in_out, gridResolution);
+	neighbors[6] = gridIndex3Dto1D(cell_i, up_down, in_out, gridResolution);
+	neighbors[7] = gridIndex3Dto1D(left_right, up_down, in_out, gridResolution);
+#endif
+
+#ifdef cellWidthOneX
+	const int M = 27;
+
+	int neighbors[M];
+	int count = 0;
+	for (int k = -1; k <= 1; ++k) {
+		for (int j = -1; j <= 1; ++j) {
+			for (int i = -1; i <= 1; ++i) {
+				int cell_x = cell_i + i;
+				int cell_y = cell_j + j;
+				int cell_z = cell_k + k;
+				neighbors[count] = gridIndex3Dto1D(cell_x, cell_y, cell_z, gridResolution);
+				count++;
+			}
+		}
+	}
+#endif
+
+	glm::vec3 deltaV		= glm::vec3(0, 0, 0);	// cumulative velocity change
+	glm::vec3 com			= glm::vec3(0, 0, 0);	// center of mass
+	glm::vec3 c				= glm::vec3(0, 0, 0);	// speed change to avoid collision
+	glm::vec3 perceivedVel	= glm::vec3(0, 0, 0);	// perceived local velocity
+	int neighbors_1 = 0;	// number of neighbors used in rule 1 test
+	int neighbors_3 = 0;	// number of neighbors used in rule 3 test
+
+	for (int i = 0; i < M; ++i) {
+		// For each cell, read the start/end indices in the boid pointer array.
+		if (neighbors[i] == -1) {
+			continue;
+		}
+		int startIdx = gridCellStartIndices[neighbors[i]];
+		int endIdx = gridCellEndIndices[neighbors[i]];
+		if (startIdx == -1) {
+			continue;
+		}
+		for (int j = startIdx; j <= endIdx; ++j) {
+			// Access each boid in the cell and compute velocity change from
+			// the boids rules, if this boid is within the neighborhood distance.
+			int boidIdx = particleArrayIndices[j];
+			if (boidIdx != particleArrayIndices[index]) {
+				float relativeDist = glm::distance(pos[boidIdx], pos[particleArrayIndices[index]]);
+				if (relativeDist < rule1Distance) {
+					neighbors_1++;
+					com += pos[boidIdx];
+				}
+				// Rule 2: boids try to stay a distance d away from each other
+				if (relativeDist < rule2Distance) {
+					c -= (pos[boidIdx] - pos[particleArrayIndices[index]]);
+				}
+				// Rule 3: boids try to match the speed of surrounding boids
+				if (relativeDist < rule3Distance) {
+					neighbors_3++;
+					perceivedVel += vel1[boidIdx];
+				}
+			}
+		}
+	}
+	// Rule 1
+	if (neighbors_1 > 0) {
+		com /= neighbors_1;
+		deltaV += (com - pos[particleArrayIndices[index]]) * rule1Scale;
+	}
+	// Rule 2
+	deltaV += c * rule2Scale;
+	// Rule 3
+	if (neighbors_3 > 0) {
+		perceivedVel /= neighbors_3;
+		deltaV += perceivedVel * rule3Scale;
+	}
+    // Clamp the speed change before putting the new speed in vel2 ?? CHECK
+	vel2[particleArrayIndices[index]] = vel1[particleArrayIndices[index]] + deltaV;
+	if (glm::length(vel2[particleArrayIndices[index]]) > maxSpeed) {
+		vel2[particleArrayIndices[index]] = maxSpeed * glm::normalize(vel2[particleArrayIndices[index]]);
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -334,54 +556,226 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
   // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// Label each boid with the index of its grid cell.
+	glm::vec3 boidCell = (pos[index] - gridMin) * inverseCellWidth;
+	int cell_i = (int)(boidCell[0]);
+	int cell_j = (int)(boidCell[1]);
+	int cell_k = (int)(boidCell[2]);
+	int boidCellIndex = gridIndex3Dto1D(cell_i, cell_j, cell_k, gridResolution);
+
+	// Identify which cells may contain neighbors. This isn't always 8.
+#ifdef cellWidthTwoX
+	int cell_2i = (int)(boidCell[0] * 2.0);
+	int cell_2j = (int)(boidCell[1] * 2.0);
+	int cell_2k = (int)(boidCell[2] * 2.0);
+
+	cell_2i = cell_2i % 2;
+	cell_2j = cell_2j % 2;
+	cell_2k = cell_2k % 2;
+
+	int left_right = cell_2i == 0 ? -1 : 1;
+	left_right += cell_i;
+	int up_down = cell_2j == 0 ? -1 : 1;
+	up_down += cell_j;
+	int in_out = cell_2k == 0 ? -1 : 1;
+	in_out += cell_k;
+
+	const int M = 8;
+
+	int neighbors[M];
+	// arranged such that access is contiguous 
+	neighbors[0] = gridIndex3Dto1D(cell_i, cell_j, cell_k, gridResolution);
+	neighbors[1] = gridIndex3Dto1D(left_right, cell_j, cell_k, gridResolution);
+	neighbors[2] = gridIndex3Dto1D(cell_i, up_down, cell_k, gridResolution);
+	neighbors[3] = gridIndex3Dto1D(left_right, up_down, cell_k, gridResolution);
+	neighbors[4] = gridIndex3Dto1D(cell_i, cell_j, in_out, gridResolution);
+	neighbors[5] = gridIndex3Dto1D(left_right, cell_j, in_out, gridResolution);
+	neighbors[6] = gridIndex3Dto1D(cell_i, up_down, in_out, gridResolution);
+	neighbors[7] = gridIndex3Dto1D(left_right, up_down, in_out, gridResolution);
+#endif
+
+#ifdef cellWidthOneX
+	const int M = 27;
+
+	int neighbors[M];
+	int count = 0;
+	for (int k = -1; k <= 1; ++k) {
+		for (int j = -1; j <= 1; ++j) {
+			for (int i = -1; i <= 1; ++i) {
+				int cell_x = cell_i + i;
+				int cell_y = cell_j + j;
+				int cell_z = cell_k + k;
+				neighbors[count] = gridIndex3Dto1D(cell_x, cell_y, cell_z, gridResolution);
+				count++;
+			}
+		}
+	}
+#endif
+
+	glm::vec3 deltaV = glm::vec3(0, 0, 0);	// cumulative velocity change
+	glm::vec3 com = glm::vec3(0, 0, 0);	// center of mass
+	glm::vec3 c = glm::vec3(0, 0, 0);	// speed change to avoid collision
+	glm::vec3 perceivedVel = glm::vec3(0, 0, 0);	// perceived local velocity
+	int neighbors_1 = 0;	// number of neighbors used in rule 1 test
+	int neighbors_3 = 0;	// number of neighbors used in rule 3 test
+
+	for (int i = 0; i < M; ++i) {
+		// For each cell, read the start/end indices in the boid pointer array.
+		if (neighbors[i] == -1) {
+			continue;
+		}
+		int startIdx = gridCellStartIndices[neighbors[i]];
+		int endIdx = gridCellEndIndices[neighbors[i]];
+		if (startIdx == -1) {
+			continue;
+		}
+		for (int j = startIdx; j <= endIdx; ++j) {
+			// Access each boid in the cell and compute velocity change from
+			// the boids rules, if this boid is within the neighborhood distance.
+			if (j != index) {
+				float relativeDist = glm::distance(pos[j], pos[index]);
+				if (relativeDist < rule1Distance) {
+					neighbors_1++;
+					com += pos[j];
+				}
+				// Rule 2: boids try to stay a distance d away from each other
+				if (relativeDist < rule2Distance) {
+					c -= (pos[j] - pos[index]);
+				}
+				// Rule 3: boids try to match the speed of surrounding boids
+				if (relativeDist < rule3Distance) {
+					neighbors_3++;
+					perceivedVel += vel1[j];
+				}
+			}
+		}
+	}
+	// Rule 1
+	if (neighbors_1 > 0) {
+		com /= neighbors_1;
+		deltaV += (com - pos[index]) * rule1Scale;
+	}
+	// Rule 2
+	deltaV += c * rule2Scale;
+	// Rule 3
+	if (neighbors_3 > 0) {
+		perceivedVel /= neighbors_3;
+		deltaV += perceivedVel * rule3Scale;
+	}
+	// Clamp the speed change before putting the new speed in vel2 ?? CHECK
+	vel2[index] = vel1[index] + deltaV;
+	if (glm::length(vel2[index]) > maxSpeed) {
+		vel2[index] = maxSpeed * glm::normalize(vel2[index]);
+	}
+}
+
+__global__ void kernShuffleParticleData(int N, int *particleArrayIndices,
+	glm::vec3 *shuffledPos, glm::vec3 *shuffledVel,
+	glm::vec3 *pos, glm::vec3 *vel) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	shuffledPos[index] = pos[particleArrayIndices[index]];
+	shuffledVel[index] = vel[particleArrayIndices[index]];
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
   // TODO-1.2 ping-pong the velocity buffers
+	glm::vec3 *temp = dev_vel2;
+	dev_vel2 = dev_vel1;
+	dev_vel1 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
+  // reset start and end buffers
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_gridCellEndIndices, -1);
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
   // - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(
+		numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_particleArrayIndices,
+		dev_pos, dev_vel1, dev_vel2);
   // - Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
   // - Ping-pong buffers as needed
+	glm::vec3 *temp = dev_vel2;
+	dev_vel2 = dev_vel1;
+	dev_vel1 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
+  // reset start and end buffers
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_gridCellEndIndices, -1);
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	kernShuffleParticleData<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleArrayIndices, dev_shufflePos, dev_shuffleVel, dev_pos, dev_vel1);
   // - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGrid, blockSize>>>(
+		numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_shufflePos, dev_shuffleVel, dev_vel2);
   // - Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_shufflePos, dev_vel2);
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	glm::vec3 *temp = dev_vel2;
+	dev_vel2 = dev_vel1;
+	dev_vel1 = temp;
+
+	glm::vec3* temp2 = dev_shufflePos;
+	dev_shufflePos = dev_pos;
+	dev_pos = temp2;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +784,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_shufflePos);
+  cudaFree(dev_shuffleVel);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..6ad8383 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**
@@ -28,8 +28,30 @@ int main(int argc, char* argv[]) {
   projectName = "565 CUDA Intro: Boids";
 
   if (init(argc, argv)) {
+	count = 0;
+	FPStotal = 0;
+
     mainLoop();
     Boids::endSimulation();
+
+	outputFile.open("FPS_Data.txt", std::ios::out | std::ios::app);
+
+	std::string test = "";
+	if (VISUALIZE)
+		test += "VIS+";
+	else
+		test += "NO_VIS+";
+	if (UNIFORM_GRID && COHERENT_GRID)
+		test += "COHERENT_GRID";
+	else if (UNIFORM_GRID)
+		test += "UNIFORM_GRID";
+	else
+		test += "NO_GRID";
+
+	outputFile << test << "+" << N_FOR_VIS << "+" << 128 << ": ";
+	outputFile << FPStotal / (count-1) << std::endl;
+
+	outputFile.close();
     return 0;
   } else {
     return 1;
@@ -230,6 +252,12 @@ void initShaders(GLuint * program) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+
+		if (count == 0) {}
+		else {
+			FPStotal += fps;
+		}
+		count++;
       }
 
       runCUDA();
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..16776d6 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -78,3 +78,10 @@ void runCUDA();
 bool init(int argc, char **argv);
 void initVAO();
 void initShaders(GLuint *program);
+
+//====================================
+// output
+//====================================
+std::ofstream outputFile;
+int count;
+float FPStotal;
\ No newline at end of file