Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toml++ unicode checking invokes undefined behaviour #144

Closed
kchalmer opened this issue Feb 24, 2022 · 9 comments
Closed

Toml++ unicode checking invokes undefined behaviour #144

kchalmer opened this issue Feb 24, 2022 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@kchalmer
Copy link

Environment

toml++ version and/or commit hash: commit 36030ca

Compiler: gcc version 11.1.0 (GCC)

C++ standard mode: gnu17 (gcc 11's default)

Target arch: x86_64

Library configuration overrides: none

Relevant compilation flags: -fsanitize=undefined

Describe the bug

Parsing with toml++ triggers undefined behaviour in the unicode checking routines for certain ASCII characters (see error message in the "Steps to reproduce" section).

Steps to reproduce (or a small repro code sample)

$ cat tomlplusplus_ub_example.cpp

#include <toml++/toml.h>

int main()
{
    auto table = toml::parse("m=1");
}

$ g++ -fsanitize=undefined -I../tomlplusplus/include tomlplusplus_ub_example.cpp -o tomlplusplus_ub_example
$ ./tomlplusplus_ub_example
../tomlplusplus/include/toml++/impl/unicode.h:139:13: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'

Additional information

This only occurs for characters with an ASCII value of 109 or larger. "l=1" (lowercase ell) parses without an error, but "m=1", "n=1", etc. trigger the UB error.

@kchalmer kchalmer added the bug Something isn't working label Feb 24, 2022
@marzer
Copy link
Owner

marzer commented Feb 24, 2022

Thanks for the report! That code is generated by a code-generator, and from a quick glance, it appears to have generated some weird code there (the same boolean expression appears twice, for example). I'll dig into this when I get the chance, hopefully some time soon.

@marzer
Copy link
Owner

marzer commented Feb 24, 2022

Thinking about it a bit more, it's also possible that it's a false-positive, since those 'error' cases above should be covered by tests already.

@kchalmer
Copy link
Author

If this was a static analyzer I'd agree, but UBSan is a runtime analyzer - it's reporting on an actual situation it has encountered during execution. It theoretically shouldn't (modulo bugs in it, of course) have any false positives.

Inspecting the code, I think I do see the actual problem. At unicode.h line 139:

if ((1ull << (static_cast<uint_least64_t>(c) - 0x2Du)) & 0xFFF43FFFFFF01FF9ull)
	return true;

c is "m" (or larger) in the failing case, "m" is ASCII 109, and 109 - 0x2Du = 109 - 45 = 64. So that results in a left shift that's equal to or larger than the size of the type in bits. And according to [expr.shift] in the standard:

The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

@marzer
Copy link
Owner

marzer commented Feb 25, 2022

Ah, yeah, you're right. Guess I need to dig in to the code generator after all. Thanks.

@kchalmer
Copy link
Author

Thanks for looking into it! This is an amazing library, I really hope to use it in a project but that project has to be UBSan clean. I wish I could provide a pull request, but my knowledge of Unicode is really limited, and I can't understand what that line of code is doing, so I don't want to risk breaking something.

@marzer
Copy link
Owner

marzer commented Feb 25, 2022

I can't understand what that line of code is doing, so I don't want to risk breaking something.

Yeah, it's pretty opaque, largely owing to the nature of unicode. Since unicode code points are spread out all over the codepoint space (owing to different languages etc.), trying to determine something relatively mundane like "is a character an X" (where X is "letter", "number", "punctuation" etc.) is an annoyingly complex task. Since I didn't want any third-party dependencies, I wrote a python script that consumes the unicode database and spits out a bunch of helper functions for this task, that boil down into a series of nested switch statements, bitmasks, and static bitmaps.

That particular line would be looking for a value in a 64-bit space (the mask at the end), starting at some known offset (the - 0x2Du bit). The script allows me to specify what sort of integer I'm using for the masks, and by default it's set to 64 bit - I suspect what's happened here is I've used >= num_bits where I should have used > num_bits (or something equally dumb).

@kchalmer
Copy link
Author

That's a great explanation, thank you!

@marzer marzer closed this as completed in 1c26ce1 Feb 26, 2022
@marzer
Copy link
Owner

marzer commented Feb 26, 2022

@kchalmer This should now be fixed in master. Turned out to be a bit less trivial than I thought; the code-generator has a few simplification passes to see if it can avoid emitting a switch statement where the other conditions are sufficiently simple (e.g. a codepoint range where almost all of the match the condition except one), but there was an edge case where some of the lower-level conditions were being hoisted up in the wrong order.

Also did my own testing with UBsan and drive-by fixed another case on one of the parser's error-handling path.

Thanks for the report!

@kchalmer
Copy link
Author

Sorry, I was away from the project for a bit, but I can confirm: all clean from my end too. Thank you very much for the quick response and fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants