-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null byte \x00
issue in byte level fsm resulting in KeyError
in BetterFSM::FSMInfo
#930
Conversation
8ce451e
to
da2608d
Compare
Should we add some test cases like the one in #833 to verify if this |
Yes exactly
…________________________________
From: Andrew Lapp ***@***.***>
Sent: Saturday, June 1, 2024 3:14:48 AM
To: outlines-dev/outlines ***@***.***>
Cc: Tommy Yang ***@***.***>; Mention ***@***.***>
Subject: Re: [outlines-dev/outlines] Fix null byte `\x00` issue in byte level fsm resulting in `KeyError` in `BetterFSM::FSMInfo` (PR #930)
Should we add some test cases like the one in #833<#833> to verify if this \x00 issue is solved in later updates as well? I am a little concerned whether numba's bug fix will break the current implementation in the future.
You mean a test case to ensure that a unicode_type starting with \x00 isn't broken in numba in the same way it breaks UnicodeCharSeq?
—
Reply to this email directly, view it on GitHub<#930 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADJEHYLTECZINZ37C5OIELLZFDD2RAVCNFSM6AAAAABIRYPGICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSHA2DOMJTGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I have introduced two new tests to verify the behavior:
|
@@ -9,7 +9,14 @@ | |||
from outlines.fsm.parsing import PartialLark, PartialPythonIndenter | |||
|
|||
|
|||
def test_partial_parsing(): | |||
@pytest.fixture | |||
def cleanup_lark_import(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't directly related to the PR, but fixes an issue - if the tests in test_parsing.py
fail, it cascades and causes a large number of unrelated tests to fail.
Follow-up to #904 by @M0gician -- Closes #904
Fixes #833
Problem 1
@M0gician identified a bug in numba's handling of
UnicodeCharSeq
. if the elements first item is\x00
, the entire element is null. This numba bug resulted in aKeyError
while compiling a patterns index for certain characters.More problem details available in #904
Solution 1
@M0gician implemented the base solution - represent token bytes with
numba.unicode_type
to avoid this bug.4a5ef55
Problem 2
Use of
unicode_type
results in ambiguous representation of hex-bytes / chars.E.g. in main a
UnicodeCharSeq
will keep the hex bytes and characters separate["😇", "9", "F", "9F"]
Because we use unicode_type instead a List[UnicodeCharSeq(2)], the example sequence would be represented as
["😇", "9", "F", "9F"]
->'😇9F9F'
This demonstrates a problem, we have no idea whether consecutive hex characters represent a byte or two separate utf-8 characters.
Solution 2
To resolve this, we prefix a hex-byte with a null byte.
["😇", "9", "F", "9F"]
->'😇9F\x009F'
.83c4d3a
Problem 3
Parsing a token to separate bytes and chars during
_walk_fsm
is inefficient and resulted in index compilation taking ~2.2x as long asmain
Solution 3
Instead of iterating over a string in
_walk_fsm
, we precompute a mapping of token -> transition key seq viaget_tokens_transition_keys()
and iterate over list of integer transition keys for each token in_walk_fsm()
56ea957
Now instead of taking ~2.2x as long as
main
, index compilation takes ~0.65x as long asmain
. However Numba takes 16% longer to compile.Benchmark Suite Results:
Benchmarks that have improved:
Benchmarks that have stayed the same:
Benchmarks that have got worse:
Performance degradation detected!