-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strings are not unicode #560
Comments
Yes, you are correct. My intention for v3 is to make the default string generator still focus on ASCII, but have significant chance of generating any unicode character as well. For v2, it should be pretty easy to add a |
|
Thanks for releasing this. I've been using it for a day or so and it works well. One thing that affected me was that it wasn't clear what a UnicodeString is supposed to be. At the moment, the UnicodeString is implemented as a sequence of UnicodeChars. This means that it sometimes generates strings that are invalid unicode, or that are not normalized. It's actually pretty valuable to have access to both invalid and non-normalized unicode strings, as it allows me to check edge cases I hadn't considered. However, it was also surprising, and required some code to filter and normalize strings, as appropriate for my use case. I think future users might benefit from one or more of:
Thanks again! |
After using this some more, I've found that UnicodeString overwhelmingly produces invalid unicode, making it not super useful. Given how easy it is to product invalid unicode if I wanted, i think it would make more sense for to produce valid unicode, preferably normalized. |
Sounds fair. I have very limited time atm sorry for the lazy question - is that achievable via |
AFAICT, most of my fuzzing time is now spent generating and filtering invalid unicode strings, so a better way to do this would probably be desirable. |
My read of the code is that strings are generated from lists of chars. This means that all generated strings only consider ascii (and similar charsets).
This means that random strings generated only touch a small portion of the space of possible strings, and not the full range of Unicode characters, including Emoji, RTL, CJK, combining characters, etc.
The text was updated successfully, but these errors were encountered: