-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nearest neighbor scaling #41
Comments
Thank you for the question. ** [WARNING] increase the middle gap to 10 percent!
** FLOPS and MOPS ** Number tasks ** Slow down in performance |
Thanks for the response! I didn't mention earlier, but the problem still exists on larger data and with weak scaling. I got to run some scaling experiments today and here is what I'm seeing. As a note all experiments were run on the rebel nodes (options requested -N 2) at ICES and for 10 iterations. Strong scaling
Weak scaling
I don't think this should be expected behavior, but I could be wrong. I ran these tests the same way (using the second half of distributed_nearest_neighbor.cpp). Re the gap warnings, I thought that perhaps they should increase linearly with the number of processes. Is it expected that the increase exceeds that? |
Interesting. I will take a look at this case over the weekend. I am not able to use ICES's machine, but I will try to do this on Stampede2. Regarding the warning, I will double check if the behavior is correct. |
Thanks for the response! I was able to take at the timings on Stampede2 -- I don't see a scaling issue there at all. The code runs really well. So there must be something else holding back our mpi performance on ICES machines. I recreated the same runs as above and found the following: Strong scaling
Weak scaling
So it seems no need to worry about the scaling, but thanks again for taking the time to respond and check it out. The only remaining confusing thing I see is the warnings -- they only appear when I use multiple mpi tasks. In case that helps track down the behavior. |
That's great. There is indeed a bug in distributed tree partition. I have fixed it and it will not emit "WARNING" unless there are quite some data points with the same coordinates (or the same distances). The fix will later be propagated to the master branch. |
We have some questions about running nearest neighbors — specifically the distributed_nearest_neighbors.cpp example. We are focusing on the DistKernelMatrix run (second half of the script), with the following parameters
n = 5000
d = 3
k = 64
As we increase the number of mpi tasks, we notice strange behavior. Most importantly, the code slows down significantly when more tasks are introduced. By the eye test, this is clear, but the reported flops and mops after each neighbor iteration tell the same story
nprocs = 1
[ RT] 31 [normal] 0 [listen] 0 [nested] 3.000E+04 flops 3.000E+04 mops
[ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops
nprocs = 2
[ RT] 32 [normal] 0 [listen] 0 [nested] 9.000E+04 flops 9.000E+04 mops
[ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops
nprocs = 4
[ RT] 36 [normal] 0 [listen] 0 [nested] 2.100E+05 flops 2.100E+05 mops
[ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops
nprocs = 8
[ RT] 48 [normal] 0 [listen] 0 [nested] 2.400E+05 flops 2.400E+05 mops
[ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops
This leads to question 1: Should the number of flops grow for the same problem size when increasing tasks, or is there a problem here?
Our hypothesis is that it has something to do with the splitter warning
[WARNING] increase the middle gap to 10 percent!
which are displayed more and more frequently as the number of processors increases. We outline that below, showing gap warnings/nn iteration.
nprocs : 1 warnings: 0
nprocs : 2 warnings: 2
nprocs : 4 warnings: 8
nprocs : 8 warnings: 24
Unless we are misunderstanding the algorithm, the warning is displayed when the there are multiple points that project to the median under that split. Our question is twofold: Why would this increase with the number of tasks? And is it fixable?
For the record, this happens even if care is taken so each processes’ data is randomized with a unique seed so there are no duplicate points. Thanks for the help!
The text was updated successfully, but these errors were encountered: