Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-compile regexp2 with regexp2cg #11

Merged
merged 2 commits into from
Feb 12, 2025
Merged

Conversation

terwey
Copy link
Contributor

@terwey terwey commented Dec 4, 2024

I noticed that https://github.com/dlclark/regexp2cg was available which can pre-compile the regular expressions used in this library. Based on the benchmarks (Macbook Pro M2) it seems to give around 2x improvement.

The tooling regexp2cg has some quirks in case the library gets more encoding in the future, but maybe this causes more interest and the project to be be picked up more.

I reorganized the tests so I could re-use the data for the benchmarks.

goos: darwin
goarch: arm64
pkg: github.com/tiktoken-go/tokenizer
                                                                            │ bench_orig.txt │          bench_regexpc.txt          │
                                                                            │     sec/op     │   sec/op     vs base                │
Tokenizer/o200k_base:_hello_world-12                                            1343.5n ± 2%   626.8n ± 5%  -53.35% (p=0.000 n=10)
Tokenizer/o200k_base:_hello__world-12                                            2.438µ ± 1%   1.106µ ± 3%  -54.63% (p=0.000 n=10)
Tokenizer/o200k_base:_hello___world-12                                           2.478µ ± 1%   1.111µ ± 3%  -55.19% (p=0.000 n=10)
Tokenizer/o200k_base:_supercalifragilistic-12                                    1.898µ ± 0%   1.350µ ± 2%  -28.87% (p=0.000 n=10)
Tokenizer/o200k_base:_We_know_what_we_are,_but_know_not_what_we_may_be.-12       8.220µ ± 1%   4.419µ ± 4%  -46.25% (p=0.000 n=10)
Tokenizer/cl100k_base:_hello_world-12                                           1065.5n ± 2%   601.5n ± 5%  -43.54% (p=0.000 n=10)
Tokenizer/cl100k_base:_hello__world-12                                          1720.5n ± 0%   901.2n ± 2%  -47.62% (p=0.000 n=10)
Tokenizer/cl100k_base:_hello___world-12                                         1762.0n ± 1%   934.8n ± 2%  -46.95% (p=0.000 n=10)
Tokenizer/cl100k_base:_supercalifragilistic-12                                   1.766µ ± 1%   1.368µ ± 2%  -22.51% (p=0.000 n=10)
Tokenizer/cl100k_base:_We_know_what_we_are,_but_know_not_what_we_may_be.-12      5.819µ ± 1%   3.934µ ± 3%  -32.39% (p=0.000 n=10)
Tokenizer/r50k_base:_hello_world-12                                              955.0n ± 1%   558.1n ± 3%  -41.55% (p=0.000 n=10)
Tokenizer/r50k_base:_hello__world-12                                            1485.5n ± 1%   838.1n ± 4%  -43.58% (p=0.000 n=10)
Tokenizer/r50k_base:_hello___world-12                                           1602.5n ± 0%   954.5n ± 4%  -40.44% (p=0.000 n=10)
Tokenizer/r50k_base:_supercalifragilistic-12                                     1.756µ ± 0%   1.319µ ± 4%  -24.89% (p=0.000 n=10)
Tokenizer/r50k_base:_We_know_what_we_are,_but_know_not_what_we_may_be.-12        5.294µ ± 1%   3.385µ ± 6%  -36.07% (p=0.000 n=10)
Tokenizer/p50k_base:_hello_world-12                                              956.8n ± 0%   527.1n ± 1%  -44.92% (p=0.000 n=10)
Tokenizer/p50k_base:_hello__world-12                                            1489.0n ± 0%   797.4n ± 0%  -46.45% (p=0.000 n=10)
Tokenizer/p50k_base:_hello___world-12                                           1502.5n ± 0%   802.4n ± 1%  -46.60% (p=0.000 n=10)
Tokenizer/p50k_base:_supercalifragilistic-12                                     1.754µ ± 1%   1.304µ ± 0%  -25.66% (p=0.000 n=10)
Tokenizer/p50k_base:_We_know_what_we_are,_but_know_not_what_we_may_be.-12        5.290µ ± 1%   3.373µ ± 1%  -36.23% (p=0.000 n=10)
Tokenizer/p50k_edit:_hello_world-12                                             1015.5n ± 3%   539.1n ± 5%  -46.91% (p=0.000 n=10)
Tokenizer/p50k_edit:_hello__world-12                                            1480.5n ± 1%   797.4n ± 1%  -46.14% (p=0.000 n=10)
Tokenizer/p50k_edit:_hello___world-12                                           1551.0n ± 5%   810.0n ± 1%  -47.78% (p=0.000 n=10)
Tokenizer/p50k_edit:_supercalifragilistic-12                                     1.750µ ± 1%   1.309µ ± 1%  -25.20% (p=0.000 n=10)
Tokenizer/p50k_edit:_We_know_what_we_are,_but_know_not_what_we_may_be.-12        5.393µ ± 2%   3.373µ ± 1%  -37.45% (p=0.000 n=10)
geomean                                                                          2.012µ        1.175µ       -41.57%

@bluescreen10
Copy link
Contributor

I was looking for something like this a while ago. Will test and give you feedback. Thanks!

@bluescreen10
Copy link
Contributor

Hey,

I can't get the regeneration to work, can you tell me what command did you execute to regenerate the code?

Ideally would like to make a comment in some file //go:generate regexp2gc ...

$ regexp2cg -path codec -package codec  -o codec/regexp.gen.go
2024/12/05 09:21:02 Create regexp for path ./tokenizer/codec, include tests=false
2024/12/05 09:21:02 file ./tokenizer/codec/r50k_base.go imports regexp2
2024/12/05 09:21:02 unknown ast node type for options: *ast.SelectorExpr &{X:regexp2 Sel:None}
2024/12/05 09:21:02 file./tokenizer/codec/cl100k_base.go imports regexp2
2024/12/05 09:21:02 unknown ast node type for options: *ast.SelectorExpr &{X:regexp2 Sel:None}
2024/12/05 09:21:02 file ./tokenizer/codec/codec.go imports regexp2
2024/12/05 09:21:02 file ./tokenizer/codec/regexp.gen.go imports regexp2
2024/12/05 09:21:02 file ./tokenizer/codec/o200k_base.go imports regexp2
2024/12/05 09:21:02 unknown ast node type for options: *ast.SelectorExpr &{X:regexp2 Sel:None}
2024/12/05 09:21:02 file ./tokenizer/codec/p50k_base.go imports regexp2
2024/12/05 09:21:02 unknown ast node type for options: *ast.SelectorExpr &{X:regexp2 Sel:None}
2024/12/05 09:21:02 file ./tokenizer/codec/p50k_edit.go imports regexp2
2024/12/05 09:21:02 unknown ast node type for options: *ast.SelectorExpr &{X:regexp2 Sel:None}

@terwey
Copy link
Contributor Author

terwey commented Dec 5, 2024

Please check my PR on the tools repo and you have to use my modifications on this repo to get it to work.

It doesn't work with MustCompile being inside a struct declaration.

@bluescreen10
Copy link
Contributor

I checked out this pr so the regexps are on a separate variable.

what's your PR on the tools repo?

@terwey
Copy link
Contributor Author

terwey commented Dec 5, 2024

dlclark/regexp2cg#2

I just ran it inside codec after doing a go build from my tools PR, without any parameters.

@bluescreen10
Copy link
Contributor

Ahh, got it! I'd like your PR on regexp2cg to be merged first before approving this PR.

Before switching the Regexp Engine I'd like to make sure this library will be maintained.

That said, I'll try to nudge your PR to dlclrak/regexp2.

Thanks for your patience.

@terwey
Copy link
Contributor Author

terwey commented Dec 5, 2024

Understandable! No rush, just nice to see performance improvements like this are possible.

@terwey
Copy link
Contributor Author

terwey commented Jan 30, 2025

@bluescreen10 just a small bump, dlclark/regexp2cg#2 has been merged

@bluescreen10
Copy link
Contributor

Yes saw that, I'll re-test and merge. Sorry it is taking me a little bit of time.

@bluescreen10
Copy link
Contributor

I unfortunately, even though it was merged, it wasn't released yet. I've created an issue to track that

dlclark/regexp2cg#4

@geraldstanje1
Copy link

@bluescreen10 how to get it without official release? how to build the lib? does it also improve memory usage?

@bluescreen10 bluescreen10 merged commit 03e0e5f into tiktoken-go:main Feb 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants