Strings are not unicode #560

pbiggar · 2021-04-12T18:08:13Z

My read of the code is that strings are generated from lists of chars. This means that all generated strings only consider ascii (and similar charsets).

This means that random strings generated only touch a small portion of the space of possible strings, and not the full range of Unicode characters, including Emoji, RTL, CJK, combining characters, etc.

kurtschelfthout · 2021-04-12T22:16:39Z

Yes, you are correct. My intention for v3 is to make the default string generator still focus on ASCII, but have significant chance of generating any unicode character as well.

For v2, it should be pretty easy to add a UnicodeString type which does the same thing.

kurtschelfthout · 2021-04-15T21:12:32Z

UnicodeString & UnicodeChar released in 2.15.2.

pbiggar · 2021-06-17T19:01:14Z

Thanks for releasing this. I've been using it for a day or so and it works well.

One thing that affected me was that it wasn't clear what a UnicodeString is supposed to be. At the moment, the UnicodeString is implemented as a sequence of UnicodeChars. This means that it sometimes generates strings that are invalid unicode, or that are not normalized.

It's actually pretty valuable to have access to both invalid and non-normalized unicode strings, as it allows me to check edge cases I hadn't considered. However, it was also surprising, and required some code to filter and normalize strings, as appropriate for my use case.

I think future users might benefit from one or more of:

discussion/documentation on how they should think about a UnicodeString
perhaps different types/generators for invalid/valid, normalized/non-normalized UnicodeStrings
sample code to handle non-normalized/invalid unicode.

Thanks again!

pbiggar · 2021-07-09T18:27:53Z

After using this some more, I've found that UnicodeString overwhelmingly produces invalid unicode, making it not super useful. Given how easy it is to product invalid unicode if I wanted, i think it would make more sense for to produce valid unicode, preferably normalized.

kurtschelfthout · 2021-07-13T06:22:37Z

Sounds fair. I have very limited time atm sorry for the lazy question - is that achievable via string.Normalize? And does the normalisation form matter?

pbiggar · 2021-07-13T14:49:10Z

String.Normalize will not fix invalid unicode, but it will throw an exception on it. Here's the string generator I use:

  let string () =
    let isValid (s : string) : bool =
      try
        let (_ : string) = s.Normalize()
        true
      with e ->
        // debuG
        //   "Failed to normalize :"
        //   $"{e}\n '{s}': (len {s.Length}, {System.BitConverter.ToString(toBytes s)})"
        false

    Arb.generate<UnicodeString>
    |> Gen.map (fun (UnicodeString s) -> s)
    |> Gen.filter isValid
    // Now that we know it can be normalized, actually normalize it
    |> Gen.map (fun s -> s.Normalize())

AFAICT, most of my fuzzing time is now spent generating and filtering invalid unicode strings, so a better way to do this would probably be desirable.

kurtschelfthout added the enhancement label Apr 12, 2021

kurtschelfthout mentioned this issue Apr 14, 2021

Add UnicodeChar and UnicodeString. #562

Merged

kurtschelfthout closed this as completed Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strings are not unicode #560

Strings are not unicode #560

pbiggar commented Apr 12, 2021

kurtschelfthout commented Apr 12, 2021 •

edited

Loading

kurtschelfthout commented Apr 15, 2021

pbiggar commented Jun 17, 2021

pbiggar commented Jul 9, 2021

kurtschelfthout commented Jul 13, 2021

pbiggar commented Jul 13, 2021 •

edited

Loading

Strings are not unicode #560

Strings are not unicode #560

Comments

pbiggar commented Apr 12, 2021

kurtschelfthout commented Apr 12, 2021 • edited Loading

kurtschelfthout commented Apr 15, 2021

pbiggar commented Jun 17, 2021

pbiggar commented Jul 9, 2021

kurtschelfthout commented Jul 13, 2021

pbiggar commented Jul 13, 2021 • edited Loading

kurtschelfthout commented Apr 12, 2021 •

edited

Loading

pbiggar commented Jul 13, 2021 •

edited

Loading