Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dictgen): Add aho-corasick support #1199

Merged
merged 1 commit into from
Dec 31, 2024
Merged

feat(dictgen): Add aho-corasick support #1199

merged 1 commit into from
Dec 31, 2024

Conversation

epage
Copy link
Collaborator

@epage epage commented Dec 31, 2024

As recommended by @BurntSushi

benches              fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ hit                             │               │               │               │         │
│  ├─ aho_corasick                 │               │               │               │         │
│  │  ╰─ finallizes  34.88 ns      │ 152 ms        │ 36.88 ns      │ 1.52 ms       │ 100     │ 100
│  ├─ cased_map                    │               │               │               │         │
│  │  ╰─ finallizes  31.88 ns      │ 5.156 µs      │ 34.38 ns      │ 85.96 ns      │ 100     │ 100
│  ├─ map                          │               │               │               │         │
│  │  ╰─ finallizes  39.88 ns      │ 3.553 µs      │ 41.88 ns      │ 77.48 ns      │ 100     │ 100
│  ├─ ordered_map                  │               │               │               │         │
│  │  ╰─ finallizes  146.8 ns      │ 3.232 µs      │ 149.8 ns      │ 205.8 ns      │ 100     │ 100
│  ╰─ trie                         │               │               │               │         │
│     ╰─ finallizes  65.88 ns      │ 6.734 µs      │ 68.88 ns      │ 136.3 ns      │ 100     │ 100
├─ miss                            │               │               │               │         │
│  ├─ aho_corasick                 │               │               │               │         │
│  │  ╰─ finalizes   6.548 ns      │ 6.99 ns       │ 6.802 ns      │ 6.781 ns      │ 100     │ 25600
│  ├─ cased_map                    │               │               │               │         │
│  │  ╰─ finalizes   11.95 ns      │ 12.14 ns      │ 12.04 ns      │ 12.04 ns      │ 100     │ 12800
│  ├─ map                          │               │               │               │         │
│  │  ╰─ finalizes   18.38 ns      │ 36.38 ns      │ 18.59 ns      │ 19.3 ns       │ 100     │ 6400
│  ├─ ordered_map                  │               │               │               │         │
│  │  ╰─ finalizes   125.4 ns      │ 198.6 ns      │ 126.8 ns      │ 128.9 ns      │ 100     │ 1600
│  ╰─ trie                         │               │               │               │         │
│     ╰─ finalizes   33.66 ns      │ 42.63 ns      │ 35.44 ns      │ 35.19 ns      │ 100     │ 3200
╰─ new                             │               │               │               │         │
   ╰─ aho_corasick   125.7 ms      │ 149.2 ms      │ 138.5 ms      │ 137.9 ms      │ 100     │ 100

Considering we should have millions of misses with only a couple of hits, this seems promising. However, the construction time is prohibitive.

@coveralls
Copy link

coveralls commented Dec 31, 2024

Pull Request Test Coverage Report for Build 12560549722

Details

  • 0 of 85 (0.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.6%) to 20.029%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/dictgen/src/gen.rs 0 1 0.0%
crates/typos-dict/tests/codegen.rs 0 21 0.0%
crates/dictgen/src/aho_corasick.rs 0 63 0.0%
Totals Coverage Status
Change from base Build 12559829744: -0.6%
Covered Lines: 553
Relevant Lines: 2761

💛 - Coveralls

crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/dictgen/src/aho_corasick.rs Fixed Show fixed Hide fixed
crates/typos-dict/benches/benches/main.rs Fixed Show fixed Hide fixed
@epage epage force-pushed the aho branch 3 times, most recently from 74d9cdb to 9e5d85a Compare December 31, 2024 13:59
@epage epage enabled auto-merge December 31, 2024 13:59
@BurntSushi
Copy link

Yeah building a DFA is extremely costly. Did you try using the contiguous NFA? Likely slower search times, but potentially much faster construction time.

(I have high hopes some day for adding support for serializing Aho-Corasick automata and then providing zero-copy deserialization APIs like what regex-automata has, but it's a big effort with some serious downsides.)

@BurntSushi
Copy link

Also, what are you using to produce those benchmark results? It looks nice!

@epage
Copy link
Collaborator Author

epage commented Dec 31, 2024

Yeah building a DFA is extremely costly. Did you try using the contiguous NFA? Likely slower search times, but potentially much faster construction time.

If its too much slower, then I might as well use PHF so unsure if its worth trying.

Also, what are you using to produce those benchmark results? It looks nice!

https://crates.io/crates/divan

It can still be jittery but appreciate the faster run time and the clean API.

@epage epage merged commit 1af4452 into crate-ci:master Dec 31, 2024
21 checks passed
@epage epage deleted the aho branch December 31, 2024 14:23
@szepeviktor
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants