-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplementation a Hash using open addressing #4557
Comments
Questions to be cleared:
If so, then single valuable design is that is adopted by PHP, Python and Ruby: "open-addressing hash array" consists indexes to entries array.
|
/cc @asterite |
Why is this and are there benchmarks to back it up? |
@akzhan , but Crystal's current hash functions are just awful for hash table with "power of two". Especially awful Float "identity" functions:
Every hash function is 'ok' if hash table is by prime numbers. But for "power of two" more work should be done. And, by the way, looks like Crystal is the only contemporary language of "application level" that doesn't fear of HashDos:
Is it intentionally? Is Crystal recommended for public interface, or is it for private tools only? |
@RX14 , what cool algorithms (like "robin-hood" hashing) for?
With ordered hash table implemented as "indices array + entries array" both first two properties are not preserved:
I agree that clever algorithms are useful. But it will be useful for UnorderedHash, if some will add it to Crystal. Ordered Hash table destroys much of gain from cleverness. |
I was sure we further hashed the @funny-falcon Why do we need an entries array instead of simply using a doubly linked list inside an |
|
OK, @funny-falcon - just to clarify, what do you recommend to implement following #4557 (comment)? Anyway object hash functions will be adjusted. |
Powers of two is not a dogma. It's just faster in some scenarios. Open addressing looks like just fit to modern processors. |
@funny-falcon Crystal is a work in progress, totally not production ready. Not a lot of thought (in fact I would say "none") has been given to hash functions, they "just work". Pull requests and discussions are welcome to discuss and enhance this, though I personally know nothing about the subject. |
@asterite can we assume Working with that assumption, several changes can be recommended for Thank you. |
You are not crazy. Just with "prime numbers" it was not strictly necessary. But both "power of two" and trick with multiplying instead of modulo requires that step, if original hash function is such dumb.
But it introduce new problems: how reliably and efficiently combine double-linked list and moving semantic of Robin-Hood hashing? At least it will demand additional memory for list. Should it be compressed for small hashes? If so, then you will have a lot of branches to deal with. And you will ruin insertion performance. Consider that most hashes in applications are "delete-less" ie they are only filled and (rarer) updated. Especially web applications (that selects a lot of db records, and only shows them). For such hashes, "indices hash + entries array" is most efficient data structure (if order of insertion should be preserved). |
First, I believe it is better to have:
For backward compatibility, it could be reverted:
(If HashState is a struct. I believe, optimizer has more freedom with such design. But it is not dogma.) And then, Hash will initialize Float's hash then could be implemented as:
Second, (a bit of self promoting) I "recommend" to use my hash function: I'd be happy to implementation all above if you like the design.
With sane hash functions, "power of two" becomes sane choice. |
@funny-falcon yeah it didn't hit me that I think we should implement an ordered and unordered hash and see how fast we can make each one. Once we have data on how much of a performance difference it makes for insertion and deletion, we can decide. I'd like to see it kept ordered but if it's going to be 2x slower I don't think it's worth it. |
Also power of 2 means that the hash table can have a load factor as low as 0.5 right? I'd prefer if our hash implementation was memory efficient. |
Back in the time, I thought it was huge mistake to introduce ordered hash for default Hash in Ruby 1.9 , cause double linked list eats a lot of memory. But then PHP adopted this new design, and then Python, and now Ruby. Now I'm not sure. Now I believe, two base classes is "righter" way to go. Question is: should OrderedHash or UnorderedHash be default Hash? |
In fact, prime numbers are less flexible in terms of memory/performance efficiency. Yes, you may have more prime numbers, but then you will pay for rehashing. Power of two allows Linear Hashing. It is straight forward to implement with chaining. I implemented it once with Cuckoo Hashing. And I know implementation with Coalesce Chaining. In any way, when we talk about OrderedHash implemented as "indices array + entries array", hash load factor affects only "indices array", and so doesn't matter that much. And another point: memory efficiency is almost always result of tradeoff - you pay raw performance for memory efficiency. Cuckoo Hashing and Robin Hood hashing trades insertion performance for lookup performance and memory efficiency. I doubt usual application will win from this tradeoff. I believe, there should not be "Ultimate Swiss Knife" Hash for every kind of usage. It should be convenient enough, fast enough and memory efficient enough for regular usage. It should not be "super convenient", nor "super fast", nor "super memory efficient", cause either "super" trades off from other. Those rare men, who need particular "super" should implement it by them-self, because they knows better what they want. |
Ahh, I talk too much. Lets go to work. |
I didn't mean to accept both ordered and unordered implementations into the stdlib, just to benchmark both and use that to decide which to accept later. |
Then bechmark should a real application, not "microbenchmark". Unordered open-addressing will beat ordered hash in almost every micro-benchmark, except, probably, iteration (and maybe insertion). But will real application gain from this "victory"? It is a good question. And OrderedHash-es are prooved to be useful, otherwise there weren't such hash in many standard libraries (Python had inefficient implementation for a while; before Ruby 1.9 became wildly adopted, ActiveSuport implemented ordered hash by itself). So, certainly OrderedHash should be in stdlib. If it will perform at 90% of UnorderedHash performance, shouldn't it be default Hash? (imho, it should) 80%? 70%? (then choice is not so obvious to me :-) ) |
@funny-falcon that's what I was getting at, of course unordered will beat ordered in the benchmarks the question is how much. I'm not entirely sure how much the compiler uses hashes but it might be a decent candidate for benchmarking (if you disable the LLVM parts and just dump the bc). Otherwise a web application uses hashes for http headers (and params but they're lazily parsed so you'll have to actually access those). Ideally we'd have both micro benchmarks and real applications being benchmarked with changes. |
@sdogruyol, @Eliasjp, @elorest, @drujensen and others: can You help to choose
Please read discussion (there are many questions). |
Rust tries to be system language, and tries to be as efficient as possible. And then it adopted strange choses as SipHash24 and 92% load factor. I proud I convinced them to switch to SipHash13 (but I'm sure, simpler hash function will be enough). And they are ready to reduce load factor (not my proud :-) ). What is Crystal's intention? Is it application language, or system language? Is it "convinient C/C++" or "fast Ruby"? While Crystal certainly could be used for both, (But yeah, C++ std:vunordered_hash is awful :-) ) System language doesn't need bullet-proof super-convinient Hash as default. It needs fast Hash (and hash functions). Application language should not demand programmer to think about such issue: Hash should be convinient and safe. |
Just my opinion: Default - Crystal is not system language. It's application/framework one (Crystal may be embeddable into Ruby applications later, as I have seen on Roadmap - "DSL for writing Ruby extensions"). Also looks like 50-75% load factor is OK. Anyway we can implement |
@funny-falcon what are the tradeoffs between Rust's hash map and what you're proposing for crystal. I'm not sure what you mean by " strange choices". |
New key during iteration could be forbidden just as documentation. It is quite easy to check during iteration, than no rehashing occured due to new keys (as Ruby's st_tables does now). But check for concurrent iteration during each insert (like Ruby's Hash does) could be inefficient a bit. Though, it will be friendlier for unexperienced programmers. |
Found during investigation, that 64bit Crystal has serious issue for serious programming: #4589 |
To protect against Hash DoS, change the way hash value is computed. Class|Struct should define method `def hashme(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Enum#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. But it is better to implement `def hashme(hasher)`. StdHasher is default hasher that uses `hashme` and it is used as default seeded hasher. It also implements `fasthash` used for fast hash of primitive types, and `unseeded` for `Enums`. Fixes crystal-lang#4578 Prerequisite for crystal-lang#4557
To protect against Hash DoS, change the way hash value is computed. Class|Struct should define method `def hashme(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Enum#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. But it is better to implement `def hashme(hasher)`. StdHasher is default hasher that uses `hashme` and it is used as default seeded hasher. It also implements `fasthash` used for fast hash of primitive types, and `unseeded` for `Enums`. Fixes crystal-lang#4578 Prerequisite for crystal-lang#4557
To protect against Hash DoS, change the way hash value is computed. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. But it is better to implement `def hash(hasher)`. StdHasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. It also implements `unseeded` for `Enums`. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. Fixes crystal-lang#4578 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581
To protect against Hash DoS, change the way hash value is computed. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. But it is better to implement `def hash(hasher)`. StdHasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. It also implements `unseeded` for `Enums`. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. (idea by Akzhan Abdulin @akzhan) Fixes crystal-lang#4578 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581
To protect against Hash DoS, change the way hash value is computed. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
Now it's time to develop new hashes! |
Hmm, maybe I should get mine to compile again. Not because I think it is very fast (I don't - f-f simply have more experience in building hash tables), but because it gives something to benchmark against. And I am quite certain I benched using numbers, and as ysbaddaden showed the hash function for those where abysmal. |
Here is a draft: https://gist.github.com/funny-falcon/084f7ebc159401dfd56bf33241a2ebd6 |
@funny-falcon it should be better in form of [WIP]Pull Request. |
Looks like Hash should be adopted for two different sizes (small (<=6?) and others). |
It would be so cool if someone would retake this issue. Maybe copying in parts the algorithm used by Ruby. Or I think just anything that uses open addressing while retaining insertion order somehow should be a good start and a good improvement because it will avoid having to allocate memory on each hash insertion. |
I'm experimenting with this a bit. I did something really simple (open addressing, linear probing, two arrays: one for indices, another for ordered entries). So far the results are promising, it works faster than the existing Hash and allocates less memory (and should also have better cache locality). I still need to take care of deleting hash entries in a smart way, though... |
Crystal Hash(K, V) should be reimplemented using open addressing and powers of two.
To be read:
Thanks to @RX14, @konovod, @yxhuvud for links.
Refs #2817, #3932, #2331 and #4033 (comment) to continue discussion and implementation.
The text was updated successfully, but these errors were encountered: