-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Simplify std::hash
#823
Conversation
Two alternatives are proposed for paring back the API of `std::hash` to be more ergnomic for both consumers and implementors.
cc @erickt, @pczarn, @sfackler, @gankro, @aturon, @tarcieri cc rust-lang/rust#20654 (lots of discussion there as well) |
I've been talking with @alexcrichton as he prepared this RFC; thanks for the great writeup! FWIW, I prefer the Alternative approach. In particular, I think it resolves almost all of the ergonomic and complexity worries, leaving you with a programming model not unlike working with I particularly like the fact that the user of a data structure can easily choose the hashing algorithm to apply, since the best policy depends on context. We can use default type parameters, as we do with The downside of
for the alternative API also applies to the main proposal; it's just a general downside relative to what we have today. As @alexcrichton says, this is a somewhat orthogonal choice to the rest of the design, but does have a pretty drastic effect on simplifying code that bounds by OTOH, I would love to hear from some experts in this area whether a global salting is enough to mitigate DoS attacks. |
These two measures ensure that each `HashMap` is randomly ordered, even if the | ||
same keys are inserted in the same order. As a result, it is quite difficult to | ||
mount a DoS attack against a `HashMap` as it is difficult to predict what | ||
collisions will happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is not strictly necessary to randomly seed each instance in order to avoid hashDoS. All of the instances of hashDoS can share the same random seed.
Use of a single global seed is both covered and recommended by the SipHash paper: https://131002.net/siphash/siphash.pdf
|
Awesome, thanks for the reference! As I understand it, though, we would need to go with the Alternative, rather than Detailed Design, to continue to use SipHash for |
Thanks for weighing in @tarcieri!
It depends a little I think. The alternative solution would allow us to use SipHash for all keys by default. We could certainly move the key generation from per- |
I would really like to see some perf comparisons of these designs. |
I've had some experience using the "Alternative" approach to create hashes in C++ and I've found it to work very well. There is even a proposal with a sample implementation to add similar functionality to the C++ standard. |
I tried adding a few benchmarks to a gist with their numbers as well: https://gist.github.com/alexcrichton/c4ea4935c77c348e1465 Algorithmically there's not a huge amount of difference already, but did you have some specific benchmarks in mind you'd like to see compared?
I was unaware of this, thanks for the link! |
``` | ||
|
||
A possible implementation of combine can be found [in the boost source | ||
code][boost-combine]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function appears to be part of boost functional. I couldn't find anything on that page that indicated that that function was secure against DOS attacks. It doesn't look like it would be particularly resistant to any attacks. Did I miss something where the boost developers discussed the security implications of that function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know that assessment is correct. I did not find much information in Boost about DoS attacks and hashing with hashmaps.
FYI - It looks like Java worked around the DOS issue by turning any HashMap bucket with too many items into a balanced tree: http://openjdk.java.net/jeps/180 |
The implementation of the hash() function in Java tends to be exceptionally weak, although quite speedy. If the goal is ergonomics and speed, it seems like the main proposal is solid. However, if security by default is a goal, it seems to me that that proposal makes things difficult. The key to DOSing a HashMap is to find different values that all hash to the same bucket. The key to defending it is to make sure that an adversary can't do that. In order to make sure that an adversary can't do that, you need to introduce something into the calculation of the hash that they can't guess - for example, a large random seed affecting all input values and a secure hash function. The key to use cryptography properly is to use well designed, peer reviewed algorithms and then use them as intended. The primary design described here involves calculating the hash of each field and then combining them with the |
@alexcrichton thanks for the clear writeup. I too tend to prefer the alternate definition at this time, but I could be swayed that it is overly generic. One question I had right off the bat -- you stated explicitly that the "byte-oriented" API in use today was known to be a performance hazard for many hashes, which might prefer to consume a region of memory. As such, would it make sense to have methods for writing slices of the primitive types as well? |
@alexcrichton I am confused by the numbers in your gist. In particular, this section of the table:
what is The revision I was looking at is https://gist.github.com/alexcrichton/c4ea4935c77c348e1465/104f1907537dca4d8b958caa35eddf3cc6d06011 |
To be clear, do you think the mitigation strategies are not sufficient? I was under the impression, for example, that at least using a randomly keyed SipHash for strings would buy us a good bit, but your point about SipHash collisions may also nullify that.
I agree that this would be nice (@aturon mentioned this to me as well in our discussions), but the problem is that today if you take
Oh I'm sorry about that, I forgot to update the code in the gist. I've been adding some more things to benchmark over time. The gist should now be updated: https://gist.github.com/alexcrichton/c4ea4935c77c348e1465. Also the benchmarks' absolute numbers may want to be taken with a grain of salt, @gankro and @seanmonstar had the idea of benchmarking against C implementations and the numbers show Rust as up to 2x slower than the C implementations. I believe @huonw, however, was looking into this and found the performance pitfall fairly surmountable. |
@alexcrichton honestly, just one additional method for |
@nikomatsakis oh in that case the alternative API is indeed based on a |
@alexcrichton what I take away from those benchmarks is:
Do you agree with those conclusions? |
It's also unclear to me how valuable the associated output type is in the alternative protocol. That seems like it would make working with generic code just a bit more tedious than the simple protocol, but without necessarily much gain or generality? |
We would like to eventually have HashMap parameterized over hash size so that you can store only 32-bit or 16-bit hashes for "known small (read: not catastrophically huge for 32-bit)" hashmaps, yielding better cache/space usage. This of course doesn't necessitate an associated type on the hasher; one can just truncate an unconditionally 64-bit hash. For a secure hash function this should be as good as directly producing 32 bits. For an insecure hash function this could be catastrophic (e.g. if your hash function for a u64 is the identity function, all numbers < 2^32 will hash to 0). |
In fact I'd go so far as to argue for 32-bit by default, since you need gigabyte-sized hashmaps to really start running into trouble (with a good hash fn, and even with us currently "wasting" a bit), and DOS attacks notably are based on small inputs. |
I think it depends on the algorithm you implement, but I'm not 100% sure that this is always the case. For example the alternative protocol requires that each implementation of For example our current implementation of SipHash works in this incremental fashion, but the C implementations benchmarked are not incremental (operate on an entire block at once). The C versions are much faster, although this is not 100% because we operate incrementally, but I would suspect that there is at least some overhead for handling incremental state even if we optimized our implementation.
In general though, yes, I believe your conclusions are accurate. The alternate protocol does vary quite a bit in speed, however, depending on what you're hashing. For example the "fast algorithm" for
I do think it depends on what you're doing when working with a hasher. If, for example, you want to operate generically over all hashers that produce a type Overall I've found
Interesting! I think I need to research some of the reasons why C++ chose |
Java, by contrast, uses i32. That said, hashCode it's a pretty ancient API from before 64-bit was even a thing. I don't think 32-bits should always be used, I just think it should be the default. Once you're at gigabyte-tier hashmaps, you need to start thinking hard about your data structure tuning, and opting into 64-bit mode should be a trivial tweak. This is especially important because basically all of the hashmap's search time is spent doing a linear search on an array of hashes. Reducing the element size in half could do wonders for cache efficiency. Although it may be the case that average search lengths are too short to see real benefits. |
@gankro I think Java made a number of mistakes where they chose 32-bit prematurely (cough Array.length cough) |
@alexcrichton I'm not qualified to evaluate a non-standard hashing scheme. All I can do is look at the scheme and either say "this looks identical to something peer reviewed and designed for that purpose" or "this looks like something new to me". In this case, this scheme looks to me like it falls into the 2nd category. Currently, if you have a struct with two fields, A and B, the hash of the struct would be defined as The definition of the Of course, this proposal doesn't suggest either of the two definitions for template <typename SizeT>
inline void hash_combine_impl(SizeT& seed, SizeT value)
{
seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2);
} Is this secure? I certainly don't know how to break it. Lets assume it is secure. What we have here, is a compression function. Compression functions are the basis of many cryptographic hashes such as MD5, SHA1, and the SHA2 family using the the Merkle–Damgård construction. The potential definition of the Another thing to consider: every cryptographic hash function that I know of takes the length of the input into account. The proposal wouldn't, however. So, as I said, I'm not qualified to say if this is secure or not. However, for all the reasons above, I do think that this is doing something new and different and that even if SipHash is used as the hash for strings, I don't think that the security proofs for SipHash apply to this new construction. I don't really know where that leaves us for claiming protection against DOS attacks. If we were using SipHash (as is currently the case), due to all the work around SipHash, I think its pretty reasonable to say that Rust is also resistant to those attacks. If we're doing something different, though, I don't know how we say that we are without having someone qualified to evaluate the scheme do so. |
I think periodically rotating the global key in a running program has no value. If the key has been disclosed, all HashMaps are poisoned at that point. We have to make a new key and start over. If you start lazily rekeying them, you still have poisoned ones sitting around that are exploitable. The simplest solution to the problem is rebooting the app and having it grab a brand new key from the OS. Also, in general, when it comes to security, less complexity is better. I think these sort of hardening attempts should be introduced in response to a real world problem, as needlessly adding complexity often causes more problems than it solves. tl;dr: use one global SipHash key obtained from OS's secure RNG for hashDoS resistance. Don't use HashMap-specific keys. Don't use a userspace CSPRNG. One global key should be fine. |
@gankro I was wondering about what you said regarding the downside of hardcoding to |
I was also debating the question of whether we should make the hasher fn parametric or the trait. If you did want to have an object type that was hashable, making the hasher fn parametric certainly makes that harder -- at least if you want that object to respect the hasher the hashtable is using. You could do it by passing in a OTOH, if we made the hasher trait itself parametric, then you can have an object of type Overall, I suspect the RFC is making the right call here, just wanted to have this tradeoff fully described (it is alluded to in the RFC). |
@nikomatsakis So an important thing to keep in mind is soft and hard hash collisions (terms I am making up for this discussion). A soft collision occurs when two hashes happen to be equal when truncated to Our current design looks like:
Ideally we don't even get soft collisions. Everything goes into a bucket and everyone's happy. However, if a soft collision occurs, we fast-path search by comparing on hard collisions. We only need to compare keys on a hard collision, which for 64-bit (really 63-bit) good hash functions, is basically guaranteed to only occur on the element we want. So our access pattern looks like If we have a bad hash function that provides only hard collision guarantees (like e.g. a u64 hashes to itself), and then truncate in our storage, we've converted all of the soft collisions at that truncation into hard collisions. This isn't catastrophic, of course. Just slow. Also it means that growing will never relieve the collision pressure, because that only resolves soft collisions. |
@gankro Everything you're saying makes sense, but I'm not sure how much it matters in practice. In particular, we do truncate to fit to the number of buckets, and that already favors hash functions that place their entropy in the low-order bits (I assume), so it doesn't seem awful to me to favor it some more. That said, on purely ergonomic grounds, it seems to me superior to keep the associated type, because it means that So a further 👍 to the design as written. |
|
||
* Due to slices being able to be hashed generically, each byte will be written | ||
individually to the `Sha1` state, which is likely to not be very efficient. | ||
* Due to slices being able to be hashed generically, the length of the slice is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an alternate approach to this, similar to how Haskell makes its strings print as strings despite just being normal lists of characters.
trait Hash {
fn hash<H: Hasher>(&self, h: &mut H);
fn hash_slice<H: Hasher>(self_: &[Self], h: &mut H) {
for x in self_ { x.hash(h) }
}
}
impl Hash for u8 {
fn hash<H: Hasher>(&self, h: &mut H) { h.write_u8(*self) }
fn hash_slice<H: Hasher>(self_: &[u8], h: &mut H) { h.write(self_) }
}
impl<'a, T: Hash> Hash for &'a [T] {
fn hash<H: Hasher>(&self, h: &mut H) {
Hash::hash_slice(*self, h);
}
}
This allows any sequence of contiguous things to ensure they're hashed quickly.
Downside: a custom implementation of hash_slice
may change behaviour, for some hashing algorithms.
This was discussed some on IRC, but just for the record: one problem with what @huonw suggests is that, without extending the |
Hm, I think I may actually lean towards @huonw's suggestion, but there's a clarification point I'd like to make. First, I do think that adding One part that would be nice to do, however, is to translate hashing With respect to a particular hash value, I think that it is OK to say that the hash value (even using the same algorithm) is only guaranteed to be precisely the same within the same process. Once you cross machines I think it's probably fair to say that the hash value of our primitives are subject to change (even using the same algorithm). This would allow for the endianness change above, and it would also segue nicely into modifying So all in all I'm in favor of @huonw's extension, although I'd like to confirm the various tweaks to semantics here and there. |
So, this is the same confusion we were having on IRC last night: the proposal is not to add it to
I'm a bit confused here (perhaps related to the above confusion): what exactly does it call
Given the use cases you've outlined for this trait, that seems quite reasonable. |
With @huonw's sketch the It ends up being a kinda weird series of indirections (slice => hash_slice => write) but the idea is to allow the implementation of |
This kind of went past my head last week, but it just occurred to me that some of this discussion has been conflating hash coding and hashing. The Java scheme of having a hashCode() method is different from having an element produce its hash. For a hash, you want:
This gives you all the truncation properties that you want. For hash coding you just need:
Hash codes are basically a weak hash function that the hashmap can further hash to get one with nice truncation properties. This works great for <=32-bit integers, hashcode(x) = x, and then the map can hash it to be useful. You can also theoretically aggregate values in a faster way than running siphash on the whole dataset, because you have weaker theoretical needs than a proper hash function. See http://opendatastructures.org/ods-python/5_3_Hash_Codes.html for details. Unfortunately I'm not well versed in the actual reality of how hash coding works out. Are the theoretical perf gains realizable in practice? If you realize the perf, are you guarding against DOS attacks? If you can cause a collision in the hash codes, you've caused a collision in the hashes -- hence the classic |
I remember some of the talk about faster hashing mentions the fact that we may not be able to treat a |
> choice][cpp-hash] here as well, but it is quite easy to use one instead of | ||
> the other. | ||
|
||
## Hashing algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this again, this is what the lingering hashing-vs-hash-coding feeling was triggered by. The Java API is a hash coding one, not a hashing one. Presumably the hashmap will hash the hash codes yielded. The C++ design, meanwhile is ambiguous to me. Its description sounds like hashing, but its guarantees look like hash coding. (refering to this API)
`read` using various techniques to prevent memory safety issues. A DoS attack | ||
against a hash map is such a common and well known exploit, however, that this | ||
RFC considers it critical to consider the design of `Hash` and its relationship | ||
with `HashMap`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I believe it is the opposite: We should guard against DoS attacks by default because it is obscure. Everyone knows about signing and password hashing, but HashMap DoS is obscure. I provide this simple cost-benefit analysis:
- If the user is aware of DoS attacks and their problem domain, they can override whatever default to get the behaviour they need/want.
- If the user is unaware and the default is secure, then at worst their application runs a bit slow.
- If the user is unaware and the default is insecure, then at worst their application gets taken down and costs big $$$.
This may be out of scope for the RFC, but I'd just like to leave this here regardless: I would be interested in the standard library exporting two types: SecureHasher and FastHasher. SecureHasher will protect you from DoS using industry standard techniques (random seeds, strong hash functions), while FastHasher will provide fast, insecure, deterministic hashing. It's a bit awkward that today we provide Sip as a secure randomized default, with nowhere to point for "what if I want to go fast or deterministic" (there's some stuff in crates.io, but I don't know what their maintenance story is). I'm not sure if we should further newtype SecureHasher as DefaultHasher, or if HashMap should strictly default to SecureHasher. Doing the full newtyping theoretically lets us change the default to FastHasher, but I'd argue that this is really a breaking change to do from a security perspective. Explicitly stating the default is "Secure" in the HashMap type declaration also hints to users that maybe they want to consider overriding the default (because they don't need security), while also warning that overriding the default is potentially dangerous. |
Replacing Secure with Fast would be like changing a standard "sign" API that uses SHA2 to use MD5. Technically valid, unlikely to affect most, but horribly catastrophic in some cases if gone unnoticed. (ignoring compatibility issues) |
MD5 is completely broken and there is absolutely no reason it should be used. If you don't need security guarantees and only want performance, use something like CRC32/64. If you want a fast cryptographically secure hash function, use Blake2(b) |
For primitives like
For now my plan is to expose the One of the benefits of moving the |
Thanks everyone for a fantastic discussion here! There's been a clear consensus for a while now about the basic thrust of this RFC; some details like |
Ack! I'm sad I missed out on this. Overall I agree with this discussion. Just to be clear though, is the accepted design the one mentioned in the detailed design? Many people spoke of liking the alternative design, but I believe those comments were made before the alternatives section was added to the RFC. @alexcrichton: Did you do any benchmarks on treating non-slice POD types as a |
@erickt yes the "Detailed design" section is the current implementation. I'm not quite sure I follow with what you're looking to benchmark though, what do you mean by casting a type like |
@alexcrichton: I should have said "transmute POD types into a fixed sized slice". For example: use std::mem;
fn into_fixed_sized_slice(value: u64) -> [u8; 8] {
unsafe {
mem::transmute(value)
}
}
fn hash(slice: &[u8]) -> u64 {
// super secure hash
let mut hash = 0;
for value in slice {
hash += *value as u64;
}
hash
}
fn main() {
let value = 12345;
let hash = hash(&into_fixed_sized_slice(value));
println!("{:?}", hash);
} |
Ah in that case I think that it's trough to benchmark because it basically depends on the hashing algorithm to see whether LLVM can optimize or not. I doubt, however, that just overwriting I suppose it really boils down to optimizing one particular hashing algorithm. If you have one in mind, I'm sure we could poke around at it! |
Hello, is there anything that guarantee the behavior of Related: rust hash() and hasher.write() results are not the same |
Two alternatives are proposed for paring back the API of
std::hash
to be moreergnomic for both consumers and implementors.
Rendered