Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive hashing #1796

Closed
wants to merge 4 commits into from
Closed

Adaptive hashing #1796

wants to merge 4 commits into from

Conversation

pczarn
Copy link

@pczarn pczarn commented Nov 20, 2016

@seanmonstar
Copy link
Contributor

I just wanted to say that this RFC was a delight to read. I'm not familiar with robin hood hashing, and so the part on choosing constants was foreign to me, but regardless, I felt the RFC made it real easy to see the motivation, benefits, and implementation. Thanks!

@burdges
Copy link

burdges commented Nov 22, 2016

About this complexity : One could iamgine a two tier hash map infrastructure, the bottom tier implements the hash map but just "throws complaints" if it "gets upset" about probing length or whatever, while the the upper tier acts as "combinators" that can decide to rebuild with a new hash function and/or hash table. That's not simpler, but it's allow more tweaks, including switching to another hash table even.

## Choosing hash functions

For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For
strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No design that I know of has resolved the issue that we run into with one shot hashing, even for a few select types: custom types that define Hash exactly like strings/slices do.

creating a HashMap from a given list of keys:

```rust
fn make_map(keys: Vec<usize>) => HashMap<usize> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on =>

the probability of having a run length of more than 640 buckets may be higher than the probability
we want, it should be low enough.

<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might want to get encoded as base64-encoded data URIs, so the RFC is self-contained.

## For the HashMap API

One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The
implementation of Hash for HashMap can hash the hashes stored in every map, rather than However,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than what?

## For the order of iteration

Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good
thing, because programmers that test programs won’t learn to rely on a fixed iteration order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: “programmers that test programs” makes little grammatical sense. Maybe:

because programmers then ensure programs do not rely on a fixed iteration order

- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in
the standard library.
- We can use some other fast one-shot hasher instead of Farmhash.
- We can add use an additional fast hash function for fast streaming hashing. The improvement would
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: “add use” maybe should be either “add” or “use”.

@sfackler
Copy link
Member

sfackler commented Nov 22, 2016

Could you expand on what the APIs will look like here? I'm guessing we'd add second hasher type parameter like struct HashMap<K, V, H = RandomState, F = FastState>, and deprecate HashMap::with_hasher in favor of a new version that takes both hashers?

I also agree with @bluss that supporting one off hashing is very much not a thing that should be mentioned off hand as a thing that is required before this RFC can be implemented.

@burdges
Copy link

burdges commented Nov 22, 2016

There are a bunch of subtle concerns here like :

Is farmhash problematic for certain embedded target due to it's large code size? It's easy to switch hash functions, but you might not want this two hash function version in that setting either.

Is this extra table table rebuild problematic for some scenarios? At present, you avoid any rebuilds by specifying the capacity correctly, but now you must disable this switching behavior too.

If you're switching hash function, then maybe you should switch from robin hood hashing to cuckoo hashing too? If (K,V) is less than like a sixth of a cache line, then you expect robin hood to be considerably faster, due to cuckoo causing a second cache miss for fetching the distant bucket. Yet, the cuckoo hash's second independent hash function sounds pretty nice if you're already attempting to resist a DoS attack. Also, the cuckoo hash should leak less timing information, due both to multi-element buckets and evicted elements being more independent, so your hash keys seem more secure.

@pczarn
Copy link
Author

pczarn commented Nov 27, 2016

@sfackler The signature of HashMap will stay the same (struct HashMap<K, V, H = RandomState>), with RandomState handling both fast and safe hashing. This is backwards-compatible, because RandomState is defined under std::collections::hash_map, and it does not guarantee the use of specific hash function(s).

@burdges I don't think having two tiers make sense. There is no other, better implementation I can imagine.

Is farmhash problematic for certain embedded target due to it's large code size? It's easy to switch hash functions, but you might not want this two hash function version in that setting either.

As far as I know, people writing for embedded targets already remember to optimize for code size, and changing the default is easy. There should be a guide for embedded programming in Rust, where optimization is mentioned. SipHash alone adds some code size, because its 1+3 rounds are inlined for speed, and there are few 64-bit embedded systems, so it's not good for them. We need measurements if you think code size is a priority.

Is this extra table table rebuild problematic for some scenarios? At present, you avoid any rebuilds by specifying the capacity correctly, but now you must disable this switching behavior too.

Excellent question. Adaptive hashing may be a problem for realtime systems. I suggest changing HashMap's load factor to 85%, and increasing the threshold for run length by a small number, so that having any reallocations is less likely. I hope changing the hasher or getting HashMap from an external crate is easy enough in case you need a 100% guarantee of cheap insertion.

Changing the load factor is proposed in rust-lang/rust#38003

One more thing: in the RFC, the probability distribution of run lengths is wrong. I have an idea how to fix the math. Soon, I'll make other corrections to the RFC too.

@sfackler
Copy link
Member

So is BuildHasher changing then?

@pczarn
Copy link
Author

pczarn commented Nov 27, 2016

@sfackler No, but DefaultHasher is changing.

@sfackler
Copy link
Member

Ok, so could you write down what is actually changing in what way, since it is apparently not clear from the RFC as written?

@burdges
Copy link

burdges commented Nov 27, 2016

I'd suggest adding an alternative like :

Abstract the HashMap and HashSet interface as a trait so that code can more easily be built to be generic over the particular HashMap implementation. Implement both the current single and this two hash function based hash map, but leave the default choice as unstable for now.

I'd say the draw back is it's kinda boost-ish to do this and involves lots more discussion over what should be in this new trait vs what should be specific to implementations.

@sfackler
Copy link
Member

The kinds of traits necessary to cover set and map functionality aren't definable in a reasonable way without HKT.

@arthurprs
Copy link

arthurprs commented Nov 27, 2016

With the early resizing this indirectly helps with #36481, right?

@nrc nrc added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 27, 2016
@funny-falcon
Copy link

There is no need for Adaptive hashing for "simple keys", cause mix function from this rfc could easily be extended to strong enough seeded variant. If we have 128bit seed, then we can simply apply "Even-Mansour scheme":

self.hash = mix(msg_data ^ seed.k0) ^ seed.k1;

But Adaptive hashing still could be useful for complex keys, so this rfc needs fast hash function for complex keys.

contain-rs/hashmap2#5 (comment)
contain-rs/hashmap2#5 (comment)

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

On the choice of fast hash function: I don't think FarmHash is the right choice. I would recommend something that exploits ILP like xxHash or SeaHash (disclaimer: I'm the author).

@funny-falcon
Copy link

Why fo you think FarmHash doesn't exploit ILP ? Your SeaHash doesn't differs much from xxHash or MetroHash. I'm not saying SeaHash is bad. It is just not unique.

What if I say, almost same performance could be achieved without multiplication?

What if I say, first a^=a>>32 cancels third one from previous block, ie third is excess and makes interblock dependecy worse?

And if you doesn't bother for hash flood attack, then there is no need in second per-block multiplication ( and shift), so your function could be twice faster.
(But then you need finalizer step: just add those removed multiplication and shifts).

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

@funny-falcon

Why fo you think FarmHash doesn't exploit ILP ?

It does, but only partially. It is not as optimized as it could be.

Your SeaHash doesn't differs much from xxHash or MetroHash.

Right, the differences are minor, but it is still able to beat both in performance.

I'm not saying SeaHash is bad. It is just not unique.

I never claimed it was ;). It is a fairly standard highly-optimized DM construction.

What if I say, almost same performance could be achieved without multiplication?

You can't. You need to look at the avalanche diagram.

What if I say, first a^=a>>32 cancels third one from previous block, ie third is excess and makes interblock dependecy worse?

It doesn't. There is a multiplication between them.

And if you doesn't bother for hash flood attack, then there is no need in second per-block multiplication ( and shift), so your function could be twice faster.

The diffusion function could, yes, but you have destroyed all of its statistical properties. The comments in the code contains some fairly detailed information about why I chose to do it this way. Essentially I need to move entropy up, then down, then up again.

I wrote about the design here: http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

Here's the construction you're describing:

That's bad. Flipping high bits won't flip lower ones.

So, you could ask, why do quality matter for adaptive hashing. It does because you don't want to accidentally switch to a secure hash function, because there is some unexpected bias in the output, and when you lose entropy every round, it's pretty bad.

@funny-falcon
Copy link

What if I say, almost same performance could be achieved without multiplication?
You can't. You need to look at the avalanche diagram.

Yes, I can. I've already did it. I achieve 3.3 byte per clock without multiplication, while best multiplicative achieves 4.4 byte/clock. Even non-multiplucative SpookyHash achieves 4 byte per cycle, but it uses 12 variables for state, and I use 4 variables (and no hidden temporary variables needed for xor-shift).

I didn't publish it yet, but if you want, I will.

It doesn't. There is a multiplication between them.

There is no multiplication between blocks. And first xor-shift from block N+1 cancels result of third xor-shift from block N.

The diffusion function could, yes, but you have destroyed all of its statistical properties.

There is no need in great statistical properties in interblock diffusion unless you concerned about security. But if you do, then you should add at least proper seeding.

But, yes, single multiplication is no ok. I take this suggestion back. Too easy to construct interblock collision.

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

Yes, I can. I've already did it. I achieve 3.3 byte per clock without multiplication, while best multiplicative achieves 4.4 byte/clock. Even non-multiplucative SpookyHash achieves 4 byte per cycle, but it uses 12 variables for state, and I use 4 variables (and no hidden temporary variables needed for xor-shift).

Those numbers are wrong. SpookyHash doesn't outperform SeaHash nor xxHash (I've benchmarked it), and it doesn't matter if you can remove a cycle or two. If you want you could replace the whole function by

 return 3;

and you would get excellent benchmarks™, but the matter of the fact is that quality is worse. When you remove the multiplication as you do, you also remove the BIC invariant as well as the avalanche criterion.

There is no multiplication between blocks. And first xor-shift from block N+1 cancels result of third xor-shift from block N.

Now, that's a true observation. It's an implementation mistake, because it shouldn't affect the result of it. It seems though that LLVM got me covered and removes the instruction. Still, it's bad code, so yeah.

@funny-falcon
Copy link

Btw look at non-assembler version of Golang hash function :
https://github.com/golang/go/blob/master/src/runtime/hash64.go

It also uses two multiplication per-block, but it uses single rotation instead of three xor-shift.
I bet: implemented in C or Rust it will be faster than SeaHash, and it will have same "statistical properties".
(It is mentioned "best multiplicative" hash function)

And it is seeded. In fact, I really believe that properly seeded Golang hash function ( and yes, SeaHash also) will be as safe as SipHash for usage in hash-table.

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

And it is seeded. In fact, I really believe that properly seeded Golang hash function ( and yes, SeaHash also) will be as safe as SipHash for usage in hash-table.

You're wrong here. It is trivial to break. I can generate collisions easily for both golang, xxHash and SeaHash.

@pczarn
Copy link
Author

pczarn commented Dec 17, 2016

@funny-falcon I read about the Even-Mansour scheme. It won't be secure with incremented keys. Currently, the seeds in two maps may differ by the least significant bit. With known iteration order for one map, an attacker could choose first N entries from that map and increment their keys. Then, constructing a map with these keys takes O(n^2) time. This O(n^2) blowup is similar to the one in rust-lang/rust#36481.

So there's a tradeoff: the scheme may work, but we must get the seeds from an RNG.

@arthurprs The early resize is fine by itself. I would rather keep the early resize restricted to maps with the default hasher. This way, the flag can be moved into RandomState. It just needs 2 special values of SipHash state.

@seanmonstar I'm glad you liked it.

@funny-falcon
Copy link

funny-falcon commented Dec 17, 2016

@pczarn instead of incrementing prng could be used.

xorshift generators could be taken to efficiently mix bits of state of any size (1 * 32bit, 2 * 32bit, 1 * 64bit or 2 * 64bit or even greater). http://xoroshiro.di.unimi.it/

Or use same mix function to transform from increment to prng. This scheme even already has a name:
SplitMix . Example: http://xoroshiro.di.unimi.it/splitmix64.c

@sfackler
Copy link
Member

I am still waiting on an update which concretely outlines how the public facing API is going to change:

  1. How do types identify themselves as one-shot hashable?
  2. How does a user configure the fast hasher as well as the slow hasher?
  3. What is the API for one-shot hashers? They are incompatible with the Hasher/Hash traits.

@arthurprs
Copy link

arthurprs commented Dec 22, 2016

I think the proposal work under the current public api. It's only implemented for the default hasher.

IMO, exposing it as a public api would be way too cumbersome.

@burdges
Copy link

burdges commented Dec 22, 2016

Another approach is simply providing an API to make the hashing more efficient by reducing the number of times it needs to be done.

Idea 1. I think the Entry API avoids repeated hashing, but borrows the table. One could imagine an Entry-like API that instead handles an index into the table, but avoids borrowing the table by storing an identifier for the table. Indexes can be invalidated by other operations, but only in one direction normally. An index would all be invalidated if the table was rebuilt due to the identifier changing.

Idea 2. Use a fast hasher in the table itself, but use a wrapper type for keys that adds a more secure hash value.

#[derive(Hash)]
struct PreHashed<T>(pub T,u64);

trait PreHasher : BuildHasher {
    fn prehashed<T: Hash>(v: T) -> WellHashed<T> {
        let mut h = self.build_hasher();
        h.input(v);
        WellHashed(v,h.finish())
    }
}

If you store these prehashed wrappers, then your slow hashing operations correspond 1-1 with user input, while you can speed up all the repeated hashing operations required by your algorithm.

In this vein, has anyone written this adaptive hashing scheme as a crate that provides a wrapper on HashMap?

@pczarn
Copy link
Author

pczarn commented Jan 3, 2017

I am still waiting on an update which concretely outlines how the public facing API is going to change

The public facing API is not going to change. The RFC does not propose any such API. The user can't configure anything. For example, think of specializing to_string for borrowed strings -- implementing a specialized behavior for a set of types does not need any changes to public facing APIs. I should update the RFC to explain this.

@aturon
Copy link
Member

aturon commented Feb 1, 2017

@pczarn Ping on updating the RFC?

@aturon
Copy link
Member

aturon commented Feb 1, 2017

@sfackler, just to check, does the WIP PR help clarify your remaining questions?

I'd love to move forward on this!

@sfackler
Copy link
Member

sfackler commented Feb 1, 2017

AFAIKT the WIP PR doesn't include the hashing changes, just an adjustment to the load factor if a long chain is encountered.

@burdges
Copy link

burdges commented Feb 1, 2017

If I understand, anyone wanting to avoid this complexity, just specifies their own BuildHasher, right?

Would this be implemented by making RandomState return an object similar to the AdaptiveHasher object in the contain-rs implementation linked from the initial comment? If so, that prevents anyone from seeding the initial weak hash function, right?

In that case, an attacker can reliably produce like 8 cache misses per query without setting long_probes = true, not much given network overhead, but still annoying if lots of queries can be packed into one packet.

I think you should consider using something like AdaptiveHasher = Result<WH,SipHash> where WH is FarmHash or whatever, and making BuildHasher provided by RandomState key even the initial WH. You could then build your specialization impls for HashMap on this Result type, so that a user can tune it with stronger initial options.

I suppose Result might not be a good choice here. Another option would be an AdaptiveHasher trait with an upgrade method. In any case, I'm proposing that whatever the method is it should be exposed, even if left unstable and feature gated.

bors added a commit to rust-lang/rust that referenced this pull request Feb 16, 2017
Adaptive hashmap implementation

All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5

 **Background**

Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481).

This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself.

**Proposal**
This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) _greatly_ attenuating the degenerate performance case.

We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe.

**Drawbacks**

* May interact badly with poor hashers.  Maps using those may not use the desired capacity.
* It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches.
* May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor.

**Example**

Example code that exploit the exposure of iteration order and weak hasher.

```
const MERGE: usize = 10_000usize;
#[bench]
fn merge_dos(b: &mut Bencher) {
    let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect();
    let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect();
    b.iter(|| {
        let mut merged = first_map.clone();
        for (&k, &v) in &second_map {
            merged.insert(k, v);
        }
        ::test::black_box(merged);
    });
}
```

_91 is stdlib and _ad is patched (the end capacity in both cases is the same)

```
running 2 tests
test _91::merge_dos              ... bench:  47,311,843 ns/iter (+/- 2,040,302)
test _ad::merge_dos              ... bench:     599,099 ns/iter (+/- 83,270)
```
@pczarn
Copy link
Author

pczarn commented Feb 19, 2017

I couldn't find the right math to describe insertion cost, so I made measurements on a real implementation. The results are a bit disappointing. The smallest reasonable forward shift threshold is 1500. The chance of reaching it at random is almost 10^-13. Having such threshold means we allow hashmap construction to take 1500^2 / 2 operations due to just kilobytes of malicious input -- this could mean copying 32MB of memory for each 12KB of malicious input during hashmap construction. However, copying memory in a loop is cheap.

I suppose a threshold of 1500 at a load factor equal 0.833 is a decent choice. I'm going to change the RFC to propose it.

Also, we can have an additional check for early detection of O(n^2) blowup to improve performance of map merges. Whenever we detect an abnormally long chunk, we need to calculate the proportion of its length to the number of all entries in the map. If we see that the chunk is a major part of the map, we can switch to safer hashing.

extrapolated_insertion_cost_4

The code for drawing the chart is in the IJulia notebook. Measurements are implemented here: pczarn/hashmap2@0127ceb

@arthurprs
Copy link

arthurprs commented Feb 19, 2017

I spend the last hour or so running simulations (thanks for the gists) and I agree 1500 will work fairly well in practice (I got 8e-5 probability at 0.9 load factor). I missed your comment about the forward shifts math being wrong... so now I have to fix rust-lang/rust#38368

Edit: My results for 0.9 load factor can be seen here https://gist.github.com/arthurprs/09d3f39d4c8cbf211919dd40ad317d21

@pczarn
Copy link
Author

pczarn commented Mar 1, 2017

I updated the RFC. Sorry for the delay.

I think only one small change is yet to be done: I need to convert the state graph to plain ASCII.

@sfackler
Copy link
Member

I have a couple of problems with the approach here:

  1. Adaptive hashing is conflated with one-shot hashing. In my experience it is common to use simple composite types as the keys in a map, which would in the proposed setup be forced to use the "safe" hashing unconditionally. These are IMO orthogonal concepts and should be treated as such.
  2. The fact that the public API doesn't change at all both means that this is totally un-extensible, and the implementation is forced to hardcode special cased behavior to this specific hasher. This kind of approach is indicative that there is some API layer that needs to be surfaced to remove the code smell. For example, BuildHasher could have a defaulted method added to return a fast hasher, or a new type parameter could be added to HashMap for the fast hasher.

@aturon
Copy link
Member

aturon commented Apr 25, 2017

This RFC has been stalled for quite a while now. While we'd really like to see improvements along these lines, I'm going to propose to close for now, until the RFC is revised to take into account the various feedback given. Please feel free to reopen with a revision!

@rfcbot fcp close

@rfcbot
Copy link
Collaborator

rfcbot commented Apr 25, 2017

Team member @aturon has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@sfackler
Copy link
Member

@kennytm what in particular is confusing about closing this?

@kennytm
Copy link
Member

kennytm commented Apr 29, 2017

@sfackler Why close, not postpone?

@sfackler
Copy link
Member

Postpone means "we want to do the thing proposed here, but not right now". We're happy to do something like what's proposed here whenever, but the specifics need to change.

@rfcbot
Copy link
Collaborator

rfcbot commented May 23, 2017

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label May 23, 2017
@rfcbot
Copy link
Collaborator

rfcbot commented Jun 2, 2017

The final comment period is now complete.

@alexcrichton
Copy link
Member

Ok looks like no new conversation happened during FCP, so I'm going to close and tag as postponed. Thanks again for the RFC @pczarn!

@alexcrichton alexcrichton added the postponed RFCs that have been postponed and may be revisited at a later time. label Jun 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. postponed RFCs that have been postponed and may be revisited at a later time. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.