-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: make string/slice more efficient with match #39525
Comments
PHF cannot be used because in case a perfect hash function cannot be found, compilation will never stop. In that sense its not any better than just handing the strings off to LLVM to figure out (which I think uses some sort of radix matching). |
Even considering birthday paradox, the possibility of hash exhaustion is very low (2^16 on 32bit, 2^32 on 64bit). The compiler would be likely OOM first when such a big match block is generated. Radix matching is not as fast as hashing (O(len) vs O(len log n)), so a PHF implementation is preferred. I'm also afraid that LLVM isn't so smart to have such a functionality. A reference helps. LLVM is written in C++ (and probably from the perspective of C++), which only have integral switch blocks. |
AFAIR for PHF-based hash maps the hash exhaustion is non-problem – a number of other problems (e.g. two keys hashing to same bucket with most of the hasher keys evaluated) arise way before there’s any sign of hash exhaustion. |
Can you elaborate on this? I haven't heard that PHF algorithms would loop indefinitely. They are considered very fast, and of course hashing is faster than any tree structure in general. |
See this for citations on why extremely high quality hash function is necessary for example. If it turns out that for some particular set of strings the hash function is not good enough, the PHF generation algorithm will simply never stop. And you want compiler to be guaranteed to terminate eventually for any possible input. |
That can always be fixed by giving up and falling back on the current implementation.
In Regardless, this issue should go in the RFC issue repo (it's not immediately actionable). |
@Stebalien good points. However, I doubt that this is a RFC issue; the point is that this change is completely transparent to user (syntax). From user's side, we don't care if PHF or radix tree is used; the main point is that it should be fast. |
I have tried to fed a 86000 entries / 9MB json converted into match syntax into the compiler; the compilation didn't complete with either match or phf_map. We probably need improvements on parsing. Ironically, Python loaded it in milliseconds, and was able to give me the length of the whole data 😂 |
See #7462 for slowness in match checking passes. I would assume that parsing wasn't the problem. |
There is no need for a perfect hash function if compilation times are a concern. For example, Java compiles switches over strings down to basically a little inlined hash table with separate chaining to resolve collisions. |
I wrote some relevant criterion benchmarks for my proc-macro library called
|
The
match
clause is already flexible to generate aswitch
IR for enum and integers.It would be good to see a good implementation for str, [T] too. https://github.com/sfackler/rust-phf is a good candiate, since it's translated to a switch+equality comparison, and the code generation doesn't take much time (.4s for 100000 entries).
The text was updated successfully, but these errors were encountered: