-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UTF-8 truncation #1390
Fix UTF-8 truncation #1390
Conversation
This does not work if é is |
@foonathan This addresses basic UTF-8 handling only, so that width and precision are handled consistently. It ensures that valid UTF-8 input results in valid UTF-8 as far as code points are considered, which is not true without this change. Handling different Unicode normalization forms, or even UTF-8 validation for that matter doesn't belong to fmt in my opinion, because 99.98% of the text on the web is NFC form anyway. But even if I'm wrong, fmt should be able to preserve existing code points at least. |
Yes, it's an improvement, but not a complete fix. Even with NFC, characters can be multiple code points. |
Code points are the units of text in u8char_t strings in fmt (see fmt::internal::count_code_points), and the PR is a complete fix under this assumption. Consider the following example, which is not related to trimming:
This will print 10 code points if the strings are either in ISO-8859-1 or UTF-8 as expected. Yet it writes 9 "characters" when I understand your definition for "character" correctly if "café" is written in NFD form ('e' and COMBINING ACUTE ACCENT). Please also note that with the PR in place users can provide overloads for their own string types for:
and use whatever meaning of "character" they need. |
I don't disagree with you, @vitaut just needs to decide wether the width should be given in code points or actual columns in the terminal. |
Ok, but then I don't see how your concern is valid in relation with this PR. It just fixes code point calculation with precision in mind, ensuring UTF-8 strings stay valid. I didn't want to go into detail how counting characters should work, I just wanted it to be consistent. I think it is unfair to discuss changes to the meaning of "character" here because it is a much more complex issue. |
Thanks for the PR. This is an improvement even though in the long term we want higher-level units. However, please reuse Line 430 in d6eede9
|
I don't see a way to reusing
The code point index is always smaller than the byte index after the first non-ASCII code point. Consider the text in my test case for example, Maybe another name would be more appropriate for |
I've renamed It follows the simple logic in There are many ways for a UTF-8 string to be invalid, but these two functions are still the minimum necessary building blocks for UTF-8 string handling in fmt, even if the actual implementation and the definition of a character are subject to change. |
Sounds reasonable. I think it should be a caller's responsibility to provide valid UTF-8. Thanks! |
This fixes #1389. The new test in format-test.cc fails in the absence of changes to format.h.