-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HashMap use a per-process or per-thread seed instead of a per-instance seed #27243
Comments
Willing to mentor |
@gankro I'd be willing to take a look at this. |
Awesome! Go for it! |
Note that this is done today for ease of implementation, if a global seed
|
Just a quick comment. There are definitely other reasons to value a seeded HashMap other than DoS attacks, if that is interesting. The most common one I deal with is when I partition work among worker threads, possibly using a hash function, and then hope that when they arrive they will not have rotted out some bits of their hashes and make HashMap perf all wonky. The "per-thread" version might not suffer from this much, but you could easily imagine other ways to shoot yourself in the foot if you re-use seeded hashes too often. I need cluster-wide consistency, so I actually seed all my own stuff and personally rot out the bits, but ... Just thought I'd mention it! |
@frankmcsherry could you elaborate on this idea of bits rotting out? |
Sure, sorry. It may be tricky to make a version using only HashMap, so perhaps the answer is ... "bah!" :) Let's say you have a bunch of workers thinking they'd like to count the words in all the strings they have. So, they distribute the words among themselves using their favorite hash function. Once the words next arrive at each worker, each puts their words in a HashMap to count them up. If the hash function they used to distribute the strings is related to the one in the HashMap (e.g. the same) the low order bits will be fixed, and bad perf can happen. So, you may say: "hey well, you don't get to use HashMap's hash function to do the distribution", and I totally don't plan on trying that, but if anything bad happens because of how the keys are laid out (say I pre-aggregate all my word counts, and ship them, using HashMap.drain()) you can get accidentally crummy performance. Random stuff like this happens, and you beat your head against the wall trying to understand what went wrong, and then you learn about number theory and that you and the HashMap implementor chose the same prime number somewhere, and ... So, probably don't sweat it. :) |
Ah, neat! Thanks! |
This reduces how much rand is required in std, and makes creating hash-maps noticably faster. The first invocation of HashMap::new() in a thread goes from ~130_000 ns to ~1500 ns (i.e. 86x faster) and later invocations go from 10-20 ns to a more consistent 1.8-2.0ns. The mean for many invocations in a loop, on a single thread, goes from ~77ns to ~1.9ns, and the standard deviation drops from 2800ns to 33ns (the *massive* variance of the old scheme comes from the occasional reseeding of the thread rng). These new numbers are only slightly higher than creating the `RandomState` outside the loop and calling `with_state` in it (i.e. only measuring the non-random parts of hashmap initialisation): 1.3 and 18 ns respectively. This new scheme has the slight downside of being consistent on a single thread, so users may unintentionally rely on the order being fixed (e.g. two hashmaps with the same contents). Closes rust-lang#27243.
This is a rebase and extension of rust-lang#31356 where we cache the keys in thread local storage. This should give us a nice speed bost in creating hash maps along with mostly retaining the property that all maps have a nondeterministic iteration order. Closes rust-lang#27243
std: Cache HashMap keys in TLS This is a rebase and extension of #31356 where we not only cache the keys in thread local storage but we also bump each key every time a new `HashMap` is created. This should give us a nice speed bost in creating hash maps along with retaining the property that all maps have a nondeterministic iteration order. Closes #27243
Currently every call to HashMap::new makes a new RandomState that makes two new calls to the thread_rng to create its seed. This is, to my knowledge cryptographic overkill, as there is no clear distinction between a single HashMap with a seed and many HashMaps with the same seed, as long as that seed is not hardcoded (and therefore able to be attacked offline).
There is no argument to be had on minimizing the risk of a single compromise, as the SipHash security model is designed to make it impossible to compromise a single map. If our users disagree with this, we may consider exposing a constructor for RandomState that explicitly reseeds it.
This can be implemented via Once or through a Thread Local Static. Whichever is more efficient.
CC @alexcrichton
The text was updated successfully, but these errors were encountered: