Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simplify std::hash #823

Merged
merged 5 commits into from
Feb 17, 2015
Merged

RFC: Simplify std::hash #823

merged 5 commits into from
Feb 17, 2015

Conversation

alexcrichton
Copy link
Member

Two alternatives are proposed for paring back the API of std::hash to be more
ergnomic for both consumers and implementors.

Rendered

Two alternatives are proposed for paring back the API of `std::hash` to be more
ergnomic for both consumers and implementors.
@alexcrichton
Copy link
Member Author

cc @erickt, @pczarn, @sfackler, @gankro, @aturon, @tarcieri

cc rust-lang/rust#20654 (lots of discussion there as well)
cc rust-lang/rust#19673

@alexcrichton alexcrichton self-assigned this Feb 11, 2015
@aturon
Copy link
Member

aturon commented Feb 11, 2015

I've been talking with @alexcrichton as he prepared this RFC; thanks for the great writeup!

FWIW, I prefer the Alternative approach.

In particular, I think it resolves almost all of the ergonomic and complexity worries, leaving you with a programming model not unlike working with Writer in IO -- but also offers the greatest potential for DoS mitigation and flexibility for hashing algorithms.

I particularly like the fact that the user of a data structure can easily choose the hashing algorithm to apply, since the best policy depends on context. We can use default type parameters, as we do with HashMap today, to provide what we believe to be the best global defaults. This flexibility is likely to help beyond the question of DoS protection.

The downside of

Implementations of Hash cannot be specialized and are forced to operate generically over the hashing algorithm provided.

for the alternative API also applies to the main proposal; it's just a general downside relative to what we have today. As @alexcrichton says, this is a somewhat orthogonal choice to the rest of the design, but does have a pretty drastic effect on simplifying code that bounds by Hash.

OTOH, I would love to hear from some experts in this area whether a global salting is enough to mitigate DoS attacks.

These two measures ensure that each `HashMap` is randomly ordered, even if the
same keys are inserted in the same order. As a result, it is quite difficult to
mount a DoS attack against a `HashMap` as it is difficult to predict what
collisions will happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is not strictly necessary to randomly seed each instance in order to avoid hashDoS. All of the instances of hashDoS can share the same random seed.

@tarcieri
Copy link

OTOH, I would love to hear from some experts in this area whether a global salting is enough to mitigate DoS attacks.

Use of a single global seed is both covered and recommended by the SipHash paper:

https://131002.net/siphash/siphash.pdf

7 Application: defense against hash flooding
We propose that hash tables switch to SipHash as a hash function. On startup a
program reads a secret SipHash key from the operating system’s cryptographic
random-number generator; the program then uses SipHash for all of its hash
tables.

@aturon
Copy link
Member

aturon commented Feb 11, 2015

@tarcieri

Use of a single global seed is both covered and recommended by the SipHash paper

Awesome, thanks for the reference!

As I understand it, though, we would need to go with the Alternative, rather than Detailed Design, to continue to use SipHash for HashMap, because SipHash requires some state and prefers to work in a byte-oriented style. @alexcrichton can you confirm?

@alexcrichton
Copy link
Member Author

Thanks for weighing in @tarcieri!


@aturon

As I understand it, though, we would need to go with the Alternative, rather than Detailed Design, to continue to use SipHash for HashMap, because SipHash requires some state and prefers to work in a byte-oriented style. @alexcrichton can you confirm?

It depends a little I think. The alternative solution would allow us to use SipHash for all keys by default. We could certainly move the key generation from per-HashMap::new call to a lazily initialized global variable pretty easily. If we were to go with the detailed design section then we could guarantee that some standard types (like strings) use a randomly-seeded SipHash instance, but we would not have the guarantee that all data is hashed with SipHash.

@Gankra
Copy link
Contributor

Gankra commented Feb 11, 2015

I would really like to see some perf comparisons of these designs.

@ctrychta
Copy link

I've had some experience using the "Alternative" approach to create hashes in C++ and I've found it to work very well. There is even a proposal with a sample implementation to add similar functionality to the C++ standard.

@alexcrichton
Copy link
Member Author

@gankro

I would really like to see some perf comparisons of these designs.

I tried adding a few benchmarks to a gist with their numbers as well:

https://gist.github.com/alexcrichton/c4ea4935c77c348e1465

Algorithmically there's not a huge amount of difference already, but did you have some specific benchmarks in mind you'd like to see compared?


@ctrychta

There is even a proposal with a sample implementation to add similar functionality to the C++ standard.

I was unaware of this, thanks for the link!

```

A possible implementation of combine can be found [in the boost source
code][boost-combine].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function appears to be part of boost functional. I couldn't find anything on that page that indicated that that function was secure against DOS attacks. It doesn't look like it would be particularly resistant to any attacks. Did I miss something where the boost developers discussed the security implications of that function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know that assessment is correct. I did not find much information in Boost about DoS attacks and hashing with hashmaps.

@DaGenix
Copy link

DaGenix commented Feb 11, 2015

FYI - It looks like Java worked around the DOS issue by turning any HashMap bucket with too many items into a balanced tree: http://openjdk.java.net/jeps/180

@DaGenix
Copy link

DaGenix commented Feb 11, 2015

The implementation of the hash() function in Java tends to be exceptionally weak, although quite speedy. If the goal is ergonomics and speed, it seems like the main proposal is solid. However, if security by default is a goal, it seems to me that that proposal makes things difficult.

The key to DOSing a HashMap is to find different values that all hash to the same bucket. The key to defending it is to make sure that an adversary can't do that. In order to make sure that an adversary can't do that, you need to introduce something into the calculation of the hash that they can't guess - for example, a large random seed affecting all input values and a secure hash function.

The key to use cryptography properly is to use well designed, peer reviewed algorithms and then use them as intended. The primary design described here involves calculating the hash of each field and then combining them with the combine() function. Each field is calculated independently. The hash() method and the combine() function are more or less stateless - they can reference global variables, but they can't depend on the number of elements that were hashed previously. I don't know of any hash function designed to work like this. This design is fast and ergonomic, but, unless someone can find a secure hash function that was designed to work in this mode, I don't see how it can be said that this provides for DOS protection.

@nikomatsakis
Copy link
Contributor

@alexcrichton thanks for the clear writeup. I too tend to prefer the alternate definition at this time, but I could be swayed that it is overly generic. One question I had right off the bat -- you stated explicitly that the "byte-oriented" API in use today was known to be a performance hazard for many hashes, which might prefer to consume a region of memory. As such, would it make sense to have methods for writing slices of the primitive types as well?

@nikomatsakis
Copy link
Contributor

@alexcrichton I am confused by the numbers in your gist. In particular, this section of the table:

...
test hash3::sip::str_long ... bench: 7113 ns/iter (+/- 95)
test hash3::sip::str_medium ... bench: 370 ns/iter (+/- 6)
test hash3::sip::str_small ... bench: 19 ns/iter (+/- 0)
test hash3::sip::u64 ... bench: 19 ns/iter (+/- 0)
test hash3::u64::str_long ... bench: 13678 ns/iter (+/- 729)
test hash3::u64::str_medium ... bench: 673 ns/iter (+/- 23)
test hash3::u64::str_small ... bench: 3 ns/iter (+/- 0)
test hash3::u64::u64 ... bench: 2 ns/iter (+/- 0)

what is hash3::sip vs hash3::u64? I don't see those modules in the source code above?

The revision I was looking at is https://gist.github.com/alexcrichton/c4ea4935c77c348e1465/104f1907537dca4d8b958caa35eddf3cc6d06011

@alexcrichton
Copy link
Member Author

@DaGenix

This design is fast and ergonomic, but, unless someone can find a secure hash function that was designed to work in this mode, I don't see how it can be said that this provides for DOS protection.

To be clear, do you think the mitigation strategies are not sufficient? I was under the impression, for example, that at least using a randomly keyed SipHash for strings would buy us a good bit, but your point about SipHash collisions may also nullify that.


@nikomatsakis

As such, would it make sense to have methods for writing slices of the primitive types as well?

I agree that this would be nice (@aturon mentioned this to me as well in our discussions), but the problem is that today if you take T: Int then you can't actually get the bytes out, so we'd have to add a whole new array of methods for writing slices of primitives. They could all be default methods in terms of writing a byte slice, but it would be an unfortunate blowup of methods (in my opinion).

what is hash3::sip vs hash3::u64? I don't see those modules in the source code above?

Oh I'm sorry about that, I forgot to update the code in the gist. I've been adding some more things to benchmark over time. The gist should now be updated: https://gist.github.com/alexcrichton/c4ea4935c77c348e1465.

Also the benchmarks' absolute numbers may want to be taken with a grain of salt, @gankro and @seanmonstar had the idea of benchmarking against C implementations and the numbers show Rust as up to 2x slower than the C implementations. I believe @huonw, however, was looking into this and found the performance pitfall fairly surmountable.

@nikomatsakis
Copy link
Contributor

@alexcrichton honestly, just one additional method for [u8] (as seems to be present in your gist?) would probably be enough. Other chunks of data (e.g., [u64]) can always be transmuted if desired.

@alexcrichton
Copy link
Member Author

@nikomatsakis oh in that case the alternative API is indeed based on a write(&mut self, &[u8]) method which is the crux of the other write_foo primitive methods.

@nikomatsakis
Copy link
Contributor

@alexcrichton what I take away from those benchmarks is:

  1. the simple protocol is faster for tiny amounts of data like a u64
  2. siphash indeed performs much better if it has a big block of data to work with (and I imagine the same is true for combine)
  3. the alternative protocol can be as fast as the simple protocol if it does not use sip

Do you agree with those conclusions?

@nikomatsakis
Copy link
Contributor

It's also unclear to me how valuable the associated output type is in the alternative protocol. That seems like it would make working with generic code just a bit more tedious than the simple protocol, but without necessarily much gain or generality?

@Gankra
Copy link
Contributor

Gankra commented Feb 11, 2015

We would like to eventually have HashMap parameterized over hash size so that you can store only 32-bit or 16-bit hashes for "known small (read: not catastrophically huge for 32-bit)" hashmaps, yielding better cache/space usage. This of course doesn't necessitate an associated type on the hasher; one can just truncate an unconditionally 64-bit hash. For a secure hash function this should be as good as directly producing 32 bits. For an insecure hash function this could be catastrophic (e.g. if your hash function for a u64 is the identity function, all numbers < 2^32 will hash to 0).

@Gankra
Copy link
Contributor

Gankra commented Feb 11, 2015

In fact I'd go so far as to argue for 32-bit by default, since you need gigabyte-sized hashmaps to really start running into trouble (with a good hash fn, and even with us currently "wasting" a bit), and DOS attacks notably are based on small inputs.

@alexcrichton
Copy link
Member Author

@nikomatsakis

the alternative protocol can be as fast as the simple protocol if it does not use sip

I think it depends on the algorithm you implement, but I'm not 100% sure that this is always the case. For example the alternative protocol requires that each implementation of Hasher can work incrementally by being able to feed in data at any time. This API contract may cause some slowdown occasionally.

For example our current implementation of SipHash works in this incremental fashion, but the C implementations benchmarked are not incremental (operate on an entire block at once). The C versions are much faster, although this is not 100% because we operate incrementally, but I would suspect that there is at least some overhead for handling incremental state even if we optimized our implementation.

Do you agree with those conclusions?

In general though, yes, I believe your conclusions are accurate. The alternate protocol does vary quite a bit in speed, however, depending on what you're hashing. For example the "fast algorithm" for u64 is quite slow for str whereas for SipHash it is the opposite.

It's also unclear to me how valuable the associated output type is in the alternative protocol. That seems like it would make working with generic code just a bit more tedious than the simple protocol, but without necessarily much gain or generality?

I do think it depends on what you're doing when working with a hasher. If, for example, you want to operate generically over all hashers that produce a type T, then it is somewhat painful (e.g. this is what HashMap does today). If a function does not care about the output type, however, and it just ferries it along then the usage is pretty ergnomic as references to H::Output aren't so bad.

Overall I've found Hash<H> to be far more painful than H::Output so I elected to keep at least that amount of generality. I would not be opposed, however, to making a decision on what the output should be. I, for example, cannot imagine a case currently that is not producing some scalar value that fits in a u64 (or smaller) which is suitable for a HashMap, for example.


@gankro

In fact I'd go so far as to argue for 32-bit by default,

Interesting! I think I need to research some of the reasons why C++ chose size_t as the output for the hash function and see if there's a good rationale against always using 32-bits or not.

@Gankra
Copy link
Contributor

Gankra commented Feb 11, 2015

Java, by contrast, uses i32. That said, hashCode it's a pretty ancient API from before 64-bit was even a thing.

I don't think 32-bits should always be used, I just think it should be the default. Once you're at gigabyte-tier hashmaps, you need to start thinking hard about your data structure tuning, and opting into 64-bit mode should be a trivial tweak.

This is especially important because basically all of the hashmap's search time is spent doing a linear search on an array of hashes. Reducing the element size in half could do wonders for cache efficiency. Although it may be the case that average search lengths are too short to see real benefits.

@nikomatsakis
Copy link
Contributor

@gankro I think Java made a number of mistakes where they chose 32-bit prematurely (cough Array.length cough)

@DaGenix
Copy link

DaGenix commented Feb 12, 2015

@alexcrichton I'm not qualified to evaluate a non-standard hashing scheme. All I can do is look at the scheme and either say "this looks identical to something peer reviewed and designed for that purpose" or "this looks like something new to me". In this case, this scheme looks to me like it falls into the 2nd category.

Currently, if you have a struct with two fields, A and B, the hash of the struct would be defined as siphash(A || B). Lets assume that we keep using SipHash to hash individual fields under this proposal. If we do that, the hash of the struct under this proposal becomes combine(siphash(A), siphash(B)). I believe, what this means is that we're creating a fully new hash function. Its using some standard parts with known security guarantees (SipHash), but we've added in a new component in the combine() function.

The definition of the combine() function isn't spelled out in the proposal, but I think it would need to be in order to evaluate the security of the proposal. As an example, lets look at some potential implementations for combine(). The most trivial case: the combine() function always returns 0. In that case, the method is clearly not resistant to DOS attacks. A slightly more interesting example: combine() just XORs its two arguments together. Again, this isn't secure since in that case: combine(siphash(A), siphash(B)) would always be 0 if A == B. My point is: the combine() function can be specified in ways that are entirely insecure, so, I don't think that even an expert would be able to evaluate this proposal without a concrete proposal for the specification of the combine() function.

Of course, this proposal doesn't suggest either of the two definitions for combine() above. It suggests that we could potentially use the definition from boost::functional:

template <typename SizeT>
inline void hash_combine_impl(SizeT& seed, SizeT value)
{
    seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

Is this secure? I certainly don't know how to break it. Lets assume it is secure. What we have here, is a compression function. Compression functions are the basis of many cryptographic hashes such as MD5, SHA1, and the SHA2 family using the the Merkle–Damgård construction. The potential definition of the combine() function above is way faster than the compression functions used by any of those hashes, however. If this definition of combine() were cryptographically strong, I would expect that something like SHA2 would be using it (or something similar). Instead, SHA2 defines a much more complex, much slower compression function. I would reason that SHA2 does this because the above definition of combine() isn't secure. Not exactly a proof, but, something to think about.

Another thing to consider: every cryptographic hash function that I know of takes the length of the input into account. The proposal wouldn't, however.

So, as I said, I'm not qualified to say if this is secure or not. However, for all the reasons above, I do think that this is doing something new and different and that even if SipHash is used as the hash for strings, I don't think that the security proofs for SipHash apply to this new construction.

I don't really know where that leaves us for claiming protection against DOS attacks. If we were using SipHash (as is currently the case), due to all the work around SipHash, I think its pretty reasonable to say that Rust is also resistant to those attacks. If we're doing something different, though, I don't know how we say that we are without having someone qualified to evaluate the scheme do so.

@tarcieri
Copy link

I think periodically rotating the global key in a running program has no value. If the key has been disclosed, all HashMaps are poisoned at that point. We have to make a new key and start over. If you start lazily rekeying them, you still have poisoned ones sitting around that are exploitable.

The simplest solution to the problem is rebooting the app and having it grab a brand new key from the OS.

Also, in general, when it comes to security, less complexity is better. I think these sort of hardening attempts should be introduced in response to a real world problem, as needlessly adding complexity often causes more problems than it solves.

tl;dr: use one global SipHash key obtained from OS's secure RNG for hashDoS resistance. Don't use HashMap-specific keys. Don't use a userspace CSPRNG. One global key should be fine.

@nikomatsakis
Copy link
Contributor

@gankro I was wondering about what you said regarding the downside of hardcoding to u64. I think you were saying that this would imply that we must select arbitrarily some numbers of bits to use, if we want to have a hashtable that only uses u32 as its hash selector. This seems to be true -- but then, isn't this always true, since we must use the hash to select a bucket from the (much smaller) list of buckets?

@nikomatsakis
Copy link
Contributor

I was also debating the question of whether we should make the hasher fn parametric or the trait. If you did want to have an object type that was hashable, making the hasher fn parametric certainly makes that harder -- at least if you want that object to respect the hasher the hashtable is using. You could do it by passing in a &mut Hasher object, since Hasher (I think) is object safe. But that seems like it'd be awfully slow since you'd be doing virtual calls for each of those words of data. I think what you would really do is to use something like the (now) alternative design for objects, where they just hardcode a hash fn and return a u64.

OTOH, if we made the hasher trait itself parametric, then you can have an object of type Hash<H> for some H, so it can fully participate in the hashtable. However, the ergonomic penalty that @alexcrichton cites seems pretty substantial. It's not so much implementing the hasher trait as it is that one cannot just have a fn parametrized over any "hashable" thing, but you must rather parametrize over a hashable thing and a hash fn.

Overall, I suspect the RFC is making the right call here, just wanted to have this tradeoff fully described (it is alluded to in the RFC).

@Gankra
Copy link
Contributor

Gankra commented Feb 13, 2015

@nikomatsakis So an important thing to keep in mind is soft and hard hash collisions (terms I am making up for this discussion). A soft collision occurs when two hashes happen to be equal when truncated to x bits. A hard collision occurs when two hashes are genuinely equal. Note that a hard collision implies a soft collision for any truncation method.

Our current design looks like:

let bucket = truncate(hash);
loop {
  if hashes[bucket] == hash && keys[bucket] == key {
    return true;
  } else {
    bucket += 1;
    // figure out if we should stop
  }
}

Ideally we don't even get soft collisions. Everything goes into a bucket and everyone's happy. However, if a soft collision occurs, we fast-path search by comparing on hard collisions. We only need to compare keys on a hard collision, which for 64-bit (really 63-bit) good hash functions, is basically guaranteed to only occur on the element we want. So our access pattern looks like [hash, hash, hash, hash, hash, key] if an element is contained, and [hash, hash, hash, hash, hash] if it's not. If, however, there were more hard collisions, our access pattern would look like [hash, key, hash, key, hash, key, hash, key] regardless of containment. This is worse for the cache, and simply doing more work to do Eq calls.

If we have a bad hash function that provides only hard collision guarantees (like e.g. a u64 hashes to itself), and then truncate in our storage, we've converted all of the soft collisions at that truncation into hard collisions.

This isn't catastrophic, of course. Just slow. Also it means that growing will never relieve the collision pressure, because that only resolves soft collisions.

@nikomatsakis
Copy link
Contributor

@gankro Everything you're saying makes sense, but I'm not sure how much it matters in practice. In particular, we do truncate to fit to the number of buckets, and that already favors hash functions that place their entropy in the low-order bits (I assume), so it doesn't seem awful to me to favor it some more.

That said, on purely ergonomic grounds, it seems to me superior to keep the associated type, because it means that HashMap takes only one "weird" type parameter (the Hasher) rather than having to separately take the Hasher and the hash size. And one can trivially wrap a hasher to produce a "truncating" hasher. This also gives the user control over where to "concentrate" the entropy.

So a further 👍 to the design as written.


* Due to slices being able to be hashed generically, each byte will be written
individually to the `Sha1` state, which is likely to not be very efficient.
* Due to slices being able to be hashed generically, the length of the slice is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an alternate approach to this, similar to how Haskell makes its strings print as strings despite just being normal lists of characters.

trait Hash {
     fn hash<H: Hasher>(&self, h: &mut H);
     fn hash_slice<H: Hasher>(self_: &[Self], h: &mut H) {
          for x in self_ { x.hash(h) }
     }
}

impl Hash for u8 {
     fn hash<H: Hasher>(&self, h: &mut H) { h.write_u8(*self) } 

     fn hash_slice<H: Hasher>(self_: &[u8], h: &mut H) { h.write(self_) }
}

impl<'a, T: Hash> Hash for &'a [T] {
     fn hash<H: Hasher>(&self, h: &mut H) {
          Hash::hash_slice(*self, h);
     }
}

This allows any sequence of contiguous things to ensure they're hashed quickly.

Downside: a custom implementation of hash_slice may change behaviour, for some hashing algorithms.

@nikomatsakis
Copy link
Contributor

This was discussed some on IRC, but just for the record: one problem with what @huonw suggests is that, without extending the Hasher trait in some way, adding this method to Hash doesn't actually allow the underlying implementation to be made faster, since the calls must still boil down to invocations of the same set of primitive Hasher methods.

@alexcrichton
Copy link
Member Author

Hm, I think I may actually lean towards @huonw's suggestion, but there's a clarification point I'd like to make.

First, I do think that adding hash_slice to the Hasher trait will improve performance quite a bit. Today generically hashing &[T] is almost always quite slow due to the one-element-at-a-time hashing strategy. By allowing slices of POD-like data to just cast itself to a &[u8] it may be quite beneficial in terms of performance. For example &[u8] will translate to one direct call to write instead of N calls to write_u8. @nikomatsakis is right though in that it doesn't make the underlying implementation faster, it just allows it to operate on larger blocks of data at once (which is indeed often faster than bits and pieces).

One part that would be nice to do, however, is to translate hashing &[u64] to a call to write. This brings up problems of endianness, though. Today we translate all scalars to little endian before calling write to ensure a "constant hash across platforms". I would like to propose that we remove this behavior and instead allow ourselves to cast &[u64] to &[u8] to pass to write (similarly for all other scalar types).

With respect to a particular hash value, I think that it is OK to say that the hash value (even using the same algorithm) is only guaranteed to be precisely the same within the same process. Once you cross machines I think it's probably fair to say that the hash value of our primitives are subject to change (even using the same algorithm). This would allow for the endianness change above, and it would also segue nicely into modifying hash_slice to do something nonstandard. It basically means that the value returned from hashing a slice is not guaranteed across processes (or in this case, types), but it is guaranteed to be the same within one process for one particular type.

So all in all I'm in favor of @huonw's extension, although I'd like to confirm the various tweaks to semantics here and there.

@aturon
Copy link
Member

aturon commented Feb 17, 2015

@alexcrichton

First, I do think that adding hash_slice to the Hasher trait will improve performance quite a bit.

So, this is the same confusion we were having on IRC last night: the proposal is not to add it to Hasher, but to Hash.

For example &[u8] will translate to one direct call to write instead of N calls to write_u8.

I'm a bit confused here (perhaps related to the above confusion): what exactly does it call write with?

With respect to a particular hash value, I think that it is OK to say that the hash value (even using the same algorithm) is only guaranteed to be precisely the same within the same process.

Given the use cases you've outlined for this trait, that seems quite reasonable.

@alexcrichton
Copy link
Member Author

@aturon

I'm a bit confused here (perhaps related to the above confusion): what exactly does it call write with?

With @huonw's sketch the Hash implementation for &[T] would delegate to Hash::hash_slice and the Hash implementation for u8 would call write for the slice case. This is the performance improvement I was alluding to previously as well.

It ends up being a kinda weird series of indirections (slice => hash_slice => write) but the idea is to allow the implementation of Hash for &[T] to be specialized for each T by moving the implementation to each implementation of the Hash trait itself.

@Gankra
Copy link
Contributor

Gankra commented Feb 17, 2015

This kind of went past my head last week, but it just occurred to me that some of this discussion has been conflating hash coding and hashing. The Java scheme of having a hashCode() method is different from having an element produce its hash.

For a hash, you want:

  • x = y -> hash(x) = hash(y)
  • x != y -> bits of hash(x) uncorrelated to bits of hash(y)

This gives you all the truncation properties that you want. For hash coding you just need:

  • x = y -> hashcode(x) = hashcode(y)
  • x != y -> hashcode(x) probably != hashcode(y)

Hash codes are basically a weak hash function that the hashmap can further hash to get one with nice truncation properties. This works great for <=32-bit integers, hashcode(x) = x, and then the map can hash it to be useful. You can also theoretically aggregate values in a faster way than running siphash on the whole dataset, because you have weaker theoretical needs than a proper hash function. See http://opendatastructures.org/ods-python/5_3_Hash_Codes.html for details.

Unfortunately I'm not well versed in the actual reality of how hash coding works out. Are the theoretical perf gains realizable in practice? If you realize the perf, are you guarding against DOS attacks? If you can cause a collision in the hash codes, you've caused a collision in the hashes -- hence the classic hashcode(pair) = pair.0 XOR pair.1 error. If you don't care about DOS, is it worth it?

@shepmaster
Copy link
Member

By allowing slices of POD-like data to just cast itself to a &[u8]

I remember some of the talk about faster hashing mentions the fact that we may not be able to treat a &[T] as one contiguous chunk of u8, as the padding bytes have undefined value. Is that not a problem with this proposal?

> choice][cpp-hash] here as well, but it is quite easy to use one instead of
> the other.

## Hashing algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this again, this is what the lingering hashing-vs-hash-coding feeling was triggered by. The Java API is a hash coding one, not a hashing one. Presumably the hashmap will hash the hash codes yielded. The C++ design, meanwhile is ambiguous to me. Its description sounds like hashing, but its guarantees look like hash coding. (refering to this API)

`read` using various techniques to prevent memory safety issues. A DoS attack
against a hash map is such a common and well known exploit, however, that this
RFC considers it critical to consider the design of `Hash` and its relationship
with `HashMap`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I believe it is the opposite: We should guard against DoS attacks by default because it is obscure. Everyone knows about signing and password hashing, but HashMap DoS is obscure. I provide this simple cost-benefit analysis:

  • If the user is aware of DoS attacks and their problem domain, they can override whatever default to get the behaviour they need/want.
  • If the user is unaware and the default is secure, then at worst their application runs a bit slow.
  • If the user is unaware and the default is insecure, then at worst their application gets taken down and costs big $$$.

@Gankra
Copy link
Contributor

Gankra commented Feb 17, 2015

This may be out of scope for the RFC, but I'd just like to leave this here regardless:

I would be interested in the standard library exporting two types: SecureHasher and FastHasher. SecureHasher will protect you from DoS using industry standard techniques (random seeds, strong hash functions), while FastHasher will provide fast, insecure, deterministic hashing.

It's a bit awkward that today we provide Sip as a secure randomized default, with nowhere to point for "what if I want to go fast or deterministic" (there's some stuff in crates.io, but I don't know what their maintenance story is). I'm not sure if we should further newtype SecureHasher as DefaultHasher, or if HashMap should strictly default to SecureHasher.

Doing the full newtyping theoretically lets us change the default to FastHasher, but I'd argue that this is really a breaking change to do from a security perspective. Explicitly stating the default is "Secure" in the HashMap type declaration also hints to users that maybe they want to consider overriding the default (because they don't need security), while also warning that overriding the default is potentially dangerous.

@Gankra
Copy link
Contributor

Gankra commented Feb 17, 2015

Replacing Secure with Fast would be like changing a standard "sign" API that uses SHA2 to use MD5. Technically valid, unlikely to affect most, but horribly catastrophic in some cases if gone unnoticed. (ignoring compatibility issues)

@tarcieri
Copy link

MD5 is completely broken and there is absolutely no reason it should be used. If you don't need security guarantees and only want performance, use something like CRC32/64. If you want a fast cryptographically secure hash function, use Blake2(b)

@alexcrichton
Copy link
Member Author

@shepmaster

By allowing slices of POD-like data to just cast itself to a &[u8]

I remember some of the talk about faster hashing mentions the fact that we may not be able to treat a &[T] as one contiguous chunk of u8, as the padding bytes have undefined value. Is that not a problem with this proposal?

For primitives like &[u64] we're guaranteed that there is no padding, and it's true that &[T] where T: Copy is not generally convertible to one call to write for this reason. Each implementation of Hash for T though can select whether it's possible (if it knows the padding) and it will always require an unsafe block to do so (even for the primitives).


@gankro

I would be interested in the standard library exporting two types: SecureHasher and FastHasher.

For now my plan is to expose the SipHasher type in std::hash to continue to be available for use as it is today. We could move in FnvHasher if we want and perhaps add some more, but I think that these sorts of additions would be backwards-compatible to add in the future (so long as we don't want to remove the existing types).

One of the benefits of moving the H: Hasher to the method instead of the trait is that the HashMap does not actually declare what hash algorithm it uses. It will be an implementation detail that we use SipHasher.

@aturon
Copy link
Member

aturon commented Feb 17, 2015

Thanks everyone for a fantastic discussion here! There's been a clear consensus for a while now about the basic thrust of this RFC; some details like hash_slice can be added as a performance optimization later, since they involve defaulted methods. We've decided to merge the RFC, and will try to bring std::hash in line with it before alpha2 if possible.

Tracking issue

@erickt
Copy link

erickt commented Feb 24, 2015

Ack! I'm sad I missed out on this. Overall I agree with this discussion. Just to be clear though, is the accepted design the one mentioned in the detailed design? Many people spoke of liking the alternative design, but I believe those comments were made before the alternatives section was added to the RFC.

@alexcrichton: Did you do any benchmarks on treating non-slice POD types as a &[u8]? It's possible llvm could optimize away us casting a type like u64 into a &[u8] and hashing that slice. If it's identical performance, we could just replace Hash::hash with this Hash::hash_slice method.

@alexcrichton
Copy link
Member Author

@erickt yes the "Detailed design" section is the current implementation. I'm not quite sure I follow with what you're looking to benchmark though, what do you mean by casting a type like u64 to &[u8]?

@erickt
Copy link

erickt commented Feb 24, 2015

@alexcrichton: I should have said "transmute POD types into a fixed sized slice". For example:

use std::mem;

fn into_fixed_sized_slice(value: u64) -> [u8; 8] {
    unsafe {
        mem::transmute(value)
    }
}

fn hash(slice: &[u8]) -> u64 {
    // super secure hash
    let mut hash = 0;
    for value in slice {
        hash += *value as u64;
    }
    hash
}

fn main() {
    let value = 12345;
    let hash = hash(&into_fixed_sized_slice(value));
    println!("{:?}", hash);
}

@alexcrichton
Copy link
Member Author

Ah in that case I think that it's trough to benchmark because it basically depends on the hashing algorithm to see whether LLVM can optimize or not. I doubt, however, that just overwriting write will be able to optimize well enough for all cases because you need to handle the incremental state and possibly receiving 7 bytes when you wanted 8. This is why we have exposed write_foo helpers on the Hasher trait though.

I suppose it really boils down to optimizing one particular hashing algorithm. If you have one in mind, I'm sure we could poke around at it!

@Stargateur
Copy link

Stargateur commented Jan 4, 2019

Hello, is there anything that guarantee the behavior of hash_slice() ? Look like there is not clearly state. Implemented behavior or defined behavior are both good to me, just if someone can confirm it, it would be nice.

Related: rust hash() and hasher.write() results are not the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.