-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease HashMap Load Factor #38003
Comments
Concrete performance numbers for various load factors would probably be helpful. |
Honestly, why isn't the load factor configurable to begin with? That would make it quite a bit easier to tune for specific use cases. |
I was going to propose 0.833 (5/6) based on some tests I ran earlier, but we need more numbers to take an informed decision. |
Looks like 0.8 gives 20% improvement on read (ok, number of buckets readed is not directly transfers to performance, but still). |
So, 0.8 compared to 0.833 consumes 5% more memory and gives 15-20% less collision chain. |
@funny-falcon how did you come up with these numbers? |
@abonander I also found surprising not being able to choose load factor, something that would be a good idea. Anyway, the question about what the sensible default should be remains. @pczarn I find strange such long tails for lookup distribution ... I had the impression the variance would be much less. Also the shape does not look right. For high load factor it should start by a gentle slope and have a steep slope near the average. Anyway, lookups are not the problem, as we are talking about 1 cache line most times, sometimes 2, rarely more than 2. I would also point out that for inserts, the cost is "amplified" by the payload size. We are not talking about only traversing hashes. An average of 10 means, in addition to the hashes, displacing 10 times the size of key and value. This makes such large numbers even more scary. |
@arthurprs only with mathematics and numbers from a table above. (Not, I've edited my comment: 15-20% less collision chain, not performance. Cause of great cache locality of Robin-Hood hashing, performance difference will be usually lesser). For used space:
|
@pssalmeida are those back of an envelope calculations? They feel a bit odd to me, because they say that the average probe count for reading at 0.9 load factor should be 2.56. I've recently measured plain linear probing and robin hood hashing and I get 5.48 as the average probe length (for the elements which are in the table). Perhaps my measurements are wrong, but it still might be a good idea to instrument and measure the current implementation, as suggested by @sfackler . |
@matklad I was checking your test-repo, I can't wrap my head around the timing/cache results. It should take less time and put less pressure on the cache, it makes no sense to me whatsoever. |
@arthurprs this puzzles me as well! I want to plot the actual distribution of probe lengths and to run some specific tests about cache access (basically execute queries like "sum of the elements from i to k + i" with different distributions of
|
Possible solve of a puzzle:
And it is clear result of its algorithm: it prefers to make short chains longer to cut probability of "too long chain". I think, if you compare "Distribution of number of buckets" between Linear hashing and RH hashing (as @pczarn did above for RH), it will describe your timing and cache misses. |
That's it @funny-falcon. It's probably worth pointing out that the hash algorithm is "~perfect" and there's no deletions involved, so that's the best case for LP. RH LP
LP
|
@matklad I made a mistake. I was using the results from https://arxiv.org/abs/1605.04031, which I had just skimmed, but now I noticed that they discuss random probing. Meanwhile I found this talk https://www.ime.usp.br/~yoshi/FoCM2014/talks/viola.pdf, which states that the mean is (1 + 1/(1-a))/2, which gives exactly your experimental results! So, the results are worse, even for lookups, which is another reason to decrease the load factor. |
Updated table with new formula for average number of buckets in lookup, for linear probing. |
Again, I care far more about actual timings of real operations rather than probe counts. |
@pssalmeida Let me know if you'd like me to run numbers. I'm running HM benchmarks extensively during the last couple of months. |
@sfackler sure. If someone who has been already doing such measurements on HashMap could volunteer it would be nice. The experiments by @matklad already show that timing and cache misses can increase over a simple linear probing. Intuitively, for simple LP, most hashes will have smaller distances to origin than RH LP, and will incur a single cache line read. So, even if some are farther, as long as they remain in a second cache line, it won't matter much if they are farther away, and only rarely a third cache line will be visited. For RH LP, most hashes will have a distance to origin closer to the mean. If that is not very small, a non-negligible number of lookups may incur a second cache line read, giving worse results. |
@arthurprs If you could run some tests it would be great. An interesting case would be doing lots of insert-delete cycles in a full loaded to capacity table. |
EDIT: updated numbers bellow |
Numbers look good to me. |
@abonander One potential concern there is that the load factor you'd select is somewhat tied to the implementation. If we significantly changed HashMap's internals (as we're probably going to do to resolve the n^2 thing), the load factor someone's selected may all of the sudden be unacceptably slow. |
@sfackler Then it may be better to abstract the load factor behind an enum that just lets them select a load factor that produces the desired effect from the implementation, maybe like the following: /// An abstraction over a hashtable's load factor which allows the user to choose which aspect they
/// would prefer to optimize, possibly at a detriment to other metrics.
pub enum LoadFactor {
/// Use a load factor that produces the fastest lookup times without seriously degrading other aspects.
OptimizeLookup,
/// Use a load factor that produces the fastest insertion times without seriously degrading other aspects.
OptimizeInsert,
/// Use a load factor that produces the lowest memory usage without seriously degrading other aspects.
SaveMemory,
/// Use a load-factor with good all-round performance.
Balanced,
/// A custom load factor whose performance may be affected by changes to the implementation.
Custom(f64),
} Then the default would be |
The fill factor isn't guaranteed or exposed directly by the api, so It shouldn't be a problem. An api for exposing/changing load factor will likely need an rfc. |
Complete results now, code is here https://gist.github.com/arthurprs/97dabb2c28f8d778dc377d69d0b95758 It's really important to emphasize that only the grow benchmark is realistic, the others simulate the worst case performance of each load factor.
|
@arthurprs Thanks for the benchmarks. The lru_sim, even if not realistic shows an impressive difference, but even the others show that there is considerable impact going from 0.91 to lower load factors. |
Going over some issues I noticed that people subscribed to this might also be interested on #38368 |
Now that #40561 is in, this is the only remaining place I can think of that we can extract nice gains (at least without some kind of major rewrite with it's own trade-offs). With the updated code above I conducted several benchmarks with various load factors: 0.909 (today), 0.875, 0.857, 0.833 and 0.8; The current implementation is reasonably compact, so it wouldn't be unreasonable to set it down to 0.80, as that's still high (the highest?) compared to most stdlibs. Something in between is probably the way to go though. |
I found out that the lookup benchs are broken. I will do another round of tests soon. |
Ok, I found some time today and here are the final results. It turns out that benchmarking across load factors is really hard. Tested load factors: 90.9% (current), 87.5%, 85.7%, 83.3% and 80%.
lru_sim (delete + insert), insert and lookup are expected to perform the same because the load factor is effectively fixed in those benchmarks. |
And some extra worst case tests, where constants are set to 95% of .capacity() after creation with with_capacity(X) https://gist.github.com/arthurprs/ae084a0c5a119ddd732d93eb6fd31d05 |
This may not be as significant with the new hashmap (hashbrown) implementation. @Amanieu might want to add his thoughts here. |
The main concern here was that the high load factor would cause many entries to be moved during an insertion. Since the new hash table implementation does not move existing entries on insert, I think this can be closed. |
Currently HashMap uses a load factor of approximately 0.9. While this is appropriate for reads (get), as Robin Hood Hashing gives a low distance from the starting point with very low, effectively fixed variance, such is not the case for writes (insert).
When writing over an occupied bucked, it will incur a displacement involving reads and writes over several buckets, either for the new value or the values already in the map. Not only this number is larger than for reads, but also has an increasing variance, quickly becoming very large as load approaches 1.
For a load factor a: (UPDATED, previous lookup values were for random probing)
1/a*ln(1/1-a))(1+1/(1-a))/2In a table, the average buckets written, the 95th percentile buckets written, and the average buckets probed in a get, for different load factors:
1.391.51.652.01.842.52.013.02.153.52.274.02.384.52.475.02.565.53.1510.54.6550.5It can be seen that for a = 0.9, the average probes when reading is just
2.565.5, but the average number of buckets written when inserting is 10 and the 95th percentile is 28 buckets. For deletes it will not be much better when doing the backwards shift, as most keys will be displaced from their origin (the very low variance behaviour or RH).While for many cases, the average over the map lifetime may not be much impacted, if it happens that a map is being operated near capacity, with elements constantly being inserted and others removed, e.g., when implementing an LRU cache, the number of buckets (re)written per insert (and also per delete) will be manifestly higher than desired.
While the exact load factor choice can be argued about, it seems that 0.9 is manifestly too high. Going from 0.9 to 0.8 would have a 12.5% memory overhead, while going from 0.8 to 0.9 doubles the average number of buckets written when inserting, and more than doubles the 95th percentile. Therefore, a load factor around 0.8 seems more reasonable.
The text was updated successfully, but these errors were encountered: