Toml++ unicode checking invokes undefined behaviour #144

kchalmer · 2022-02-24T22:04:52Z

Environment

toml++ version and/or commit hash: commit 36030ca

Compiler: gcc version 11.1.0 (GCC)

C++ standard mode: gnu17 (gcc 11's default)

Target arch: x86_64

Library configuration overrides: none

Relevant compilation flags: -fsanitize=undefined

Describe the bug

Parsing with toml++ triggers undefined behaviour in the unicode checking routines for certain ASCII characters (see error message in the "Steps to reproduce" section).

Steps to reproduce (or a small repro code sample)

$ cat tomlplusplus_ub_example.cpp

#include <toml++/toml.h>

int main()
{
    auto table = toml::parse("m=1");
}

$ g++ -fsanitize=undefined -I../tomlplusplus/include tomlplusplus_ub_example.cpp -o tomlplusplus_ub_example
$ ./tomlplusplus_ub_example
../tomlplusplus/include/toml++/impl/unicode.h:139:13: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'

Additional information

This only occurs for characters with an ASCII value of 109 or larger. "l=1" (lowercase ell) parses without an error, but "m=1", "n=1", etc. trigger the UB error.

The text was updated successfully, but these errors were encountered:

marzer · 2022-02-24T23:20:44Z

Thanks for the report! That code is generated by a code-generator, and from a quick glance, it appears to have generated some weird code there (the same boolean expression appears twice, for example). I'll dig into this when I get the chance, hopefully some time soon.

marzer · 2022-02-24T23:54:03Z

Thinking about it a bit more, it's also possible that it's a false-positive, since those 'error' cases above should be covered by tests already.

kchalmer · 2022-02-25T15:00:41Z

If this was a static analyzer I'd agree, but UBSan is a runtime analyzer - it's reporting on an actual situation it has encountered during execution. It theoretically shouldn't (modulo bugs in it, of course) have any false positives.

Inspecting the code, I think I do see the actual problem. At unicode.h line 139:

if ((1ull << (static_cast<uint_least64_t>(c) - 0x2Du)) & 0xFFF43FFFFFF01FF9ull)
	return true;

c is "m" (or larger) in the failing case, "m" is ASCII 109, and 109 - 0x2Du = 109 - 45 = 64. So that results in a left shift that's equal to or larger than the size of the type in bits. And according to [expr.shift] in the standard:

The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

marzer · 2022-02-25T15:02:56Z

Ah, yeah, you're right. Guess I need to dig in to the code generator after all. Thanks.

kchalmer · 2022-02-25T15:05:00Z

Thanks for looking into it! This is an amazing library, I really hope to use it in a project but that project has to be UBSan clean. I wish I could provide a pull request, but my knowledge of Unicode is really limited, and I can't understand what that line of code is doing, so I don't want to risk breaking something.

marzer · 2022-02-25T15:11:32Z

I can't understand what that line of code is doing, so I don't want to risk breaking something.

Yeah, it's pretty opaque, largely owing to the nature of unicode. Since unicode code points are spread out all over the codepoint space (owing to different languages etc.), trying to determine something relatively mundane like "is a character an X" (where X is "letter", "number", "punctuation" etc.) is an annoyingly complex task. Since I didn't want any third-party dependencies, I wrote a python script that consumes the unicode database and spits out a bunch of helper functions for this task, that boil down into a series of nested switch statements, bitmasks, and static bitmaps.

That particular line would be looking for a value in a 64-bit space (the mask at the end), starting at some known offset (the - 0x2Du bit). The script allows me to specify what sort of integer I'm using for the masks, and by default it's set to 64 bit - I suspect what's happened here is I've used >= num_bits where I should have used > num_bits (or something equally dumb).

kchalmer · 2022-02-25T15:12:53Z

That's a great explanation, thank you!

marzer · 2022-02-26T16:11:00Z

@kchalmer This should now be fixed in master. Turned out to be a bit less trivial than I thought; the code-generator has a few simplification passes to see if it can avoid emitting a switch statement where the other conditions are sufficiently simple (e.g. a codepoint range where almost all of the match the condition except one), but there was an edge case where some of the lower-level conditions were being hoisted up in the wrong order.

Also did my own testing with UBsan and drive-by fixed another case on one of the parser's error-handling path.

Thanks for the report!

kchalmer · 2022-03-11T14:14:07Z

Sorry, I was away from the project for a bit, but I can confirm: all clean from my end too. Thank you very much for the quick response and fix!

kchalmer added the bug Something isn't working label Feb 24, 2022

kchalmer assigned marzer Feb 24, 2022

marzer closed this as completed in 1c26ce1 Feb 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toml++ unicode checking invokes undefined behaviour #144

Toml++ unicode checking invokes undefined behaviour #144

kchalmer commented Feb 24, 2022

marzer commented Feb 24, 2022 •

edited

Loading

marzer commented Feb 24, 2022

kchalmer commented Feb 25, 2022

marzer commented Feb 25, 2022

kchalmer commented Feb 25, 2022

marzer commented Feb 25, 2022 •

edited

Loading

kchalmer commented Feb 25, 2022

marzer commented Feb 26, 2022 •

edited

Loading

kchalmer commented Mar 11, 2022

Toml++ unicode checking invokes undefined behaviour #144

Toml++ unicode checking invokes undefined behaviour #144

Comments

kchalmer commented Feb 24, 2022

Environment

Describe the bug

Steps to reproduce (or a small repro code sample)

Additional information

marzer commented Feb 24, 2022 • edited Loading

marzer commented Feb 24, 2022

kchalmer commented Feb 25, 2022

marzer commented Feb 25, 2022

kchalmer commented Feb 25, 2022

marzer commented Feb 25, 2022 • edited Loading

kchalmer commented Feb 25, 2022

marzer commented Feb 26, 2022 • edited Loading

kchalmer commented Mar 11, 2022

marzer commented Feb 24, 2022 •

edited

Loading

marzer commented Feb 25, 2022 •

edited

Loading

marzer commented Feb 26, 2022 •

edited

Loading