Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: make string/slice more efficient with match #39525

Open
ishitatsuyuki opened this issue Feb 4, 2017 · 11 comments
Open

Suggestion: make string/slice more efficient with match #39525

ishitatsuyuki opened this issue Feb 4, 2017 · 11 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ishitatsuyuki
Copy link
Contributor

The match clause is already flexible to generate a switch IR for enum and integers.

It would be good to see a good implementation for str, [T] too. https://github.com/sfackler/rust-phf is a good candiate, since it's translated to a switch+equality comparison, and the code generation doesn't take much time (.4s for 100000 entries).

@nagisa
Copy link
Member

nagisa commented Feb 4, 2017

PHF cannot be used because in case a perfect hash function cannot be found, compilation will never stop. In that sense its not any better than just handing the strings off to LLVM to figure out (which I think uses some sort of radix matching).

@ishitatsuyuki
Copy link
Contributor Author

Even considering birthday paradox, the possibility of hash exhaustion is very low (2^16 on 32bit, 2^32 on 64bit). The compiler would be likely OOM first when such a big match block is generated.

Radix matching is not as fast as hashing (O(len) vs O(len log n)), so a PHF implementation is preferred.

I'm also afraid that LLVM isn't so smart to have such a functionality. A reference helps. LLVM is written in C++ (and probably from the perspective of C++), which only have integral switch blocks.

@nagisa
Copy link
Member

nagisa commented Feb 4, 2017

Even considering birthday paradox, the possibility of hash exhaustion is very low

AFAIR for PHF-based hash maps the hash exhaustion is non-problem – a number of other problems (e.g. two keys hashing to same bucket with most of the hasher keys evaluated) arise way before there’s any sign of hash exhaustion.

@ishitatsuyuki
Copy link
Contributor Author

Can you elaborate on this? I haven't heard that PHF algorithms would loop indefinitely. They are considered very fast, and of course hashing is faster than any tree structure in general.

@nagisa
Copy link
Member

nagisa commented Feb 4, 2017

See this for citations on why extremely high quality hash function is necessary for example. If it turns out that for some particular set of strings the hash function is not good enough, the PHF generation algorithm will simply never stop.

And you want compiler to be guaranteed to terminate eventually for any possible input.

@Stebalien
Copy link
Contributor

the PHF generation algorithm will simply never stop.

That can always be fixed by giving up and falling back on the current implementation.

of course hashing is faster than any tree structure in general

In O terms, yes. In practice, a tree structure will be faster for tiny data sets (and I'd guess that most switches are very small).


Regardless, this issue should go in the RFC issue repo (it's not immediately actionable).

@ishitatsuyuki
Copy link
Contributor Author

@Stebalien good points.

However, I doubt that this is a RFC issue; the point is that this change is completely transparent to user (syntax).

From user's side, we don't care if PHF or radix tree is used; the main point is that it should be fast.

@ishitatsuyuki
Copy link
Contributor Author

I have tried to fed a 86000 entries / 9MB json converted into match syntax into the compiler; the compilation didn't complete with either match or phf_map. We probably need improvements on parsing.

Ironically, Python loaded it in milliseconds, and was able to give me the length of the whole data 😂
Additional note: "eval"ing it took about 500ms. json.loads was instant, of course.

@Mark-Simulacrum
Copy link
Member

I have tried to fed a 86000 entries / 9MB json converted into match syntax into the compiler; the compilation didn't complete with either match or phf_map. We probably need improvements on parsing.

See #7462 for slowness in match checking passes. I would assume that parsing wasn't the problem.

@Mark-Simulacrum Mark-Simulacrum added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 20, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@sfackler
Copy link
Member

sfackler commented Mar 1, 2018

There is no need for a perfect hash function if compilation times are a concern. For example, Java compiles switches over strings down to basically a little inlined hash table with separate chaining to resolve collisions.

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 15, 2019

I wrote some relevant criterion benchmarks for my proc-macro library called enum-utils. That library derives FromStr for C-like enums using a comparison trie, and the benchmarks compare the trie against a vanilla match statement, phf and gperf. Each benchmark uses a corpus of strings and is variant over the "miss rate" of the input data (what percentage of the input is part of the test corpus).

Rust keywords:
gnuplot_canvas

HTTP/2 header names:
http

250 simple english words:
english

phf is slow enough that I'm worried I've misconfigured it somehow. gperf is quite fast at runtime but takes a long time to compile large corpora (>1000 words) and requires some tuning. In any case, a comparison trie is competitive with hashing techniques, never fails at compile-time and is pretty easy to implement, so I think it should be considered as an option here.

@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants