Suggestion: make string/slice more efficient with match #39525

ishitatsuyuki · 2017-02-04T08:25:26Z

The match clause is already flexible to generate a switch IR for enum and integers.

It would be good to see a good implementation for str, [T] too. https://github.com/sfackler/rust-phf is a good candiate, since it's translated to a switch+equality comparison, and the code generation doesn't take much time (.4s for 100000 entries).

The text was updated successfully, but these errors were encountered:

nagisa · 2017-02-04T09:49:39Z

PHF cannot be used because in case a perfect hash function cannot be found, compilation will never stop. In that sense its not any better than just handing the strings off to LLVM to figure out (which I think uses some sort of radix matching).

ishitatsuyuki · 2017-02-04T10:21:51Z

Even considering birthday paradox, the possibility of hash exhaustion is very low (2^16 on 32bit, 2^32 on 64bit). The compiler would be likely OOM first when such a big match block is generated.

Radix matching is not as fast as hashing (O(len) vs O(len log n)), so a PHF implementation is preferred.

I'm also afraid that LLVM isn't so smart to have such a functionality. A reference helps. LLVM is written in C++ (and probably from the perspective of C++), which only have integral switch blocks.

nagisa · 2017-02-04T10:39:35Z

Even considering birthday paradox, the possibility of hash exhaustion is very low

AFAIR for PHF-based hash maps the hash exhaustion is non-problem – a number of other problems (e.g. two keys hashing to same bucket with most of the hasher keys evaluated) arise way before there’s any sign of hash exhaustion.

ishitatsuyuki · 2017-02-04T14:08:10Z

Can you elaborate on this? I haven't heard that PHF algorithms would loop indefinitely. They are considered very fast, and of course hashing is faster than any tree structure in general.

nagisa · 2017-02-04T21:16:44Z

See this for citations on why extremely high quality hash function is necessary for example. If it turns out that for some particular set of strings the hash function is not good enough, the PHF generation algorithm will simply never stop.

And you want compiler to be guaranteed to terminate eventually for any possible input.

Stebalien · 2017-02-04T23:39:12Z

the PHF generation algorithm will simply never stop.

That can always be fixed by giving up and falling back on the current implementation.

of course hashing is faster than any tree structure in general

In O terms, yes. In practice, a tree structure will be faster for tiny data sets (and I'd guess that most switches are very small).

Regardless, this issue should go in the RFC issue repo (it's not immediately actionable).

ishitatsuyuki · 2017-02-05T02:17:55Z

@Stebalien good points.

However, I doubt that this is a RFC issue; the point is that this change is completely transparent to user (syntax).

From user's side, we don't care if PHF or radix tree is used; the main point is that it should be fast.

ishitatsuyuki · 2017-02-05T02:52:01Z

I have tried to fed a 86000 entries / 9MB json converted into match syntax into the compiler; the compilation didn't complete with either match or phf_map. We probably need improvements on parsing.

Ironically, Python loaded it in milliseconds, and was able to give me the length of the whole data 😂
Additional note: "eval"ing it took about 500ms. json.loads was instant, of course.

Mark-Simulacrum · 2017-05-20T15:18:45Z

I have tried to fed a 86000 entries / 9MB json converted into match syntax into the compiler; the compilation didn't complete with either match or phf_map. We probably need improvements on parsing.

See #7462 for slowness in match checking passes. I would assume that parsing wasn't the problem.

sfackler · 2018-03-01T06:39:54Z

There is no need for a perfect hash function if compilation times are a concern. For example, Java compiles switches over strings down to basically a little inlined hash table with separate chaining to resolve collisions.

ecstatic-morse · 2019-10-15T17:44:12Z

I wrote some relevant criterion benchmarks for my proc-macro library called enum-utils. That library derives FromStr for C-like enums using a comparison trie, and the benchmarks compare the trie against a vanilla match statement, phf and gperf. Each benchmark uses a corpus of strings and is variant over the "miss rate" of the input data (what percentage of the input is part of the test corpus).

Rust keywords:

HTTP/2 header names:

250 simple english words:

phf is slow enough that I'm worried I've misconfigured it somehow. gperf is quite fast at runtime but takes a long time to compile large corpora (>1000 words) and requires some tuning. In any case, a comparison trie is competitive with hashing techniques, never fails at compile-time and is pretty easy to implement, so I think it should be considered as an option here.

ishitatsuyuki mentioned this issue Feb 4, 2017

Integrate into compiler rust-phf/rust-phf#105

Closed

Mark-Simulacrum added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 20, 2017

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017

jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: make string/slice more efficient with match #39525

Suggestion: make string/slice more efficient with match #39525

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017 •

edited

Loading

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017

Stebalien commented Feb 4, 2017

ishitatsuyuki commented Feb 5, 2017

ishitatsuyuki commented Feb 5, 2017

Mark-Simulacrum commented May 20, 2017

sfackler commented Mar 1, 2018

ecstatic-morse commented Oct 15, 2019 •

edited

Loading

Suggestion: make string/slice more efficient with match #39525

Suggestion: make string/slice more efficient with match #39525

Comments

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017 • edited Loading

ishitatsuyuki commented Feb 4, 2017

nagisa commented Feb 4, 2017

Stebalien commented Feb 4, 2017

ishitatsuyuki commented Feb 5, 2017

ishitatsuyuki commented Feb 5, 2017

Mark-Simulacrum commented May 20, 2017

sfackler commented Mar 1, 2018

ecstatic-morse commented Oct 15, 2019 • edited Loading

nagisa commented Feb 4, 2017 •

edited

Loading

ecstatic-morse commented Oct 15, 2019 •

edited

Loading