-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toml++ unicode checking invokes undefined behaviour #144
Comments
Thanks for the report! That code is generated by a code-generator, and from a quick glance, it appears to have generated some weird code there (the same boolean expression appears twice, for example). I'll dig into this when I get the chance, hopefully some time soon. |
Thinking about it a bit more, it's also possible that it's a false-positive, since those 'error' cases above should be covered by tests already. |
If this was a static analyzer I'd agree, but UBSan is a runtime analyzer - it's reporting on an actual situation it has encountered during execution. It theoretically shouldn't (modulo bugs in it, of course) have any false positives. Inspecting the code, I think I do see the actual problem. At unicode.h line 139:
c is "m" (or larger) in the failing case, "m" is ASCII 109, and 109 - 0x2Du = 109 - 45 = 64. So that results in a left shift that's equal to or larger than the size of the type in bits. And according to [expr.shift] in the standard:
|
Ah, yeah, you're right. Guess I need to dig in to the code generator after all. Thanks. |
Thanks for looking into it! This is an amazing library, I really hope to use it in a project but that project has to be UBSan clean. I wish I could provide a pull request, but my knowledge of Unicode is really limited, and I can't understand what that line of code is doing, so I don't want to risk breaking something. |
Yeah, it's pretty opaque, largely owing to the nature of unicode. Since unicode code points are spread out all over the codepoint space (owing to different languages etc.), trying to determine something relatively mundane like "is a character an X" (where X is "letter", "number", "punctuation" etc.) is an annoyingly complex task. Since I didn't want any third-party dependencies, I wrote a python script that consumes the unicode database and spits out a bunch of helper functions for this task, that boil down into a series of nested switch statements, bitmasks, and static bitmaps. That particular line would be looking for a value in a 64-bit space (the mask at the end), starting at some known offset (the |
That's a great explanation, thank you! |
@kchalmer This should now be fixed in master. Turned out to be a bit less trivial than I thought; the code-generator has a few simplification passes to see if it can avoid emitting a switch statement where the other conditions are sufficiently simple (e.g. a codepoint range where almost all of the match the condition except one), but there was an edge case where some of the lower-level conditions were being hoisted up in the wrong order. Also did my own testing with UBsan and drive-by fixed another case on one of the parser's error-handling path. Thanks for the report! |
Sorry, I was away from the project for a bit, but I can confirm: all clean from my end too. Thank you very much for the quick response and fix! |
Environment
toml++ version and/or commit hash: commit 36030ca
Compiler: gcc version 11.1.0 (GCC)
C++ standard mode: gnu17 (gcc 11's default)
Target arch: x86_64
Library configuration overrides: none
Relevant compilation flags: -fsanitize=undefined
Describe the bug
Parsing with toml++ triggers undefined behaviour in the unicode checking routines for certain ASCII characters (see error message in the "Steps to reproduce" section).
Steps to reproduce (or a small repro code sample)
$ cat tomlplusplus_ub_example.cpp
$ g++ -fsanitize=undefined -I../tomlplusplus/include tomlplusplus_ub_example.cpp -o tomlplusplus_ub_example
$ ./tomlplusplus_ub_example
../tomlplusplus/include/toml++/impl/unicode.h:139:13: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'
Additional information
This only occurs for characters with an ASCII value of 109 or larger. "l=1" (lowercase ell) parses without an error, but "m=1", "n=1", etc. trigger the UB error.
The text was updated successfully, but these errors were encountered: