Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings are not unicode #560

Closed
pbiggar opened this issue Apr 12, 2021 · 6 comments
Closed

Strings are not unicode #560

pbiggar opened this issue Apr 12, 2021 · 6 comments

Comments

@pbiggar
Copy link

pbiggar commented Apr 12, 2021

My read of the code is that strings are generated from lists of chars. This means that all generated strings only consider ascii (and similar charsets).

This means that random strings generated only touch a small portion of the space of possible strings, and not the full range of Unicode characters, including Emoji, RTL, CJK, combining characters, etc.

@kurtschelfthout
Copy link
Member

kurtschelfthout commented Apr 12, 2021

Yes, you are correct. My intention for v3 is to make the default string generator still focus on ASCII, but have significant chance of generating any unicode character as well.

For v2, it should be pretty easy to add a UnicodeString type which does the same thing.

@kurtschelfthout
Copy link
Member

UnicodeString & UnicodeChar released in 2.15.2.

@pbiggar
Copy link
Author

pbiggar commented Jun 17, 2021

Thanks for releasing this. I've been using it for a day or so and it works well.

One thing that affected me was that it wasn't clear what a UnicodeString is supposed to be. At the moment, the UnicodeString is implemented as a sequence of UnicodeChars. This means that it sometimes generates strings that are invalid unicode, or that are not normalized.

It's actually pretty valuable to have access to both invalid and non-normalized unicode strings, as it allows me to check edge cases I hadn't considered. However, it was also surprising, and required some code to filter and normalize strings, as appropriate for my use case.

I think future users might benefit from one or more of:

  • discussion/documentation on how they should think about a UnicodeString
  • perhaps different types/generators for invalid/valid, normalized/non-normalized UnicodeStrings
  • sample code to handle non-normalized/invalid unicode.

Thanks again!

@pbiggar
Copy link
Author

pbiggar commented Jul 9, 2021

After using this some more, I've found that UnicodeString overwhelmingly produces invalid unicode, making it not super useful. Given how easy it is to product invalid unicode if I wanted, i think it would make more sense for to produce valid unicode, preferably normalized.

@kurtschelfthout
Copy link
Member

Sounds fair. I have very limited time atm sorry for the lazy question - is that achievable via string.Normalize? And does the normalisation form matter?

@pbiggar
Copy link
Author

pbiggar commented Jul 13, 2021

String.Normalize will not fix invalid unicode, but it will throw an exception on it. Here's the string generator I use:

  let string () =
    let isValid (s : string) : bool =
      try
        let (_ : string) = s.Normalize()
        true
      with e ->
        // debuG
        //   "Failed to normalize :"
        //   $"{e}\n '{s}': (len {s.Length}, {System.BitConverter.ToString(toBytes s)})"
        false

    Arb.generate<UnicodeString>
    |> Gen.map (fun (UnicodeString s) -> s)
    |> Gen.filter isValid
    // Now that we know it can be normalized, actually normalize it
    |> Gen.map (fun s -> s.Normalize())

AFAICT, most of my fuzzing time is now spent generating and filtering invalid unicode strings, so a better way to do this would probably be desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants