-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode codepoint flags for custom regexs #7245
Conversation
Looks like the tokenizer tests are failing on Windows for some reason: |
I can not debug this in local, it is possible to skip all but the failing test? I have reviewed the previous logs but that test was not executed, so I think i'm going to start from a clean point and redo all commits until I see the fail. Also I found that compiling tests with |
The problem is the stack size limit in Windows. According to MSVC \STACK documentation:
|
afcbcb5
to
6ca6c46
Compare
I think I'm done here. Now I have the base to fix tokenizers. |
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM
Use flags for each unicode category (
\p{N}
,\p{L}
,\p{Z}
, ...) instead of definitionsCODEPOINT_TYPE_*
.Including helper flags for common regex params like
\s
(only this for now),\d
,\w
...This simplifies writing custom regexs.
All flags are precomputed in
unicode-data.cpp
generated bygen-unicode-data.py
.