`of_utf8`: add `Uchar.is_valid` to check the input #51

Lucccyo · 2023-02-13T14:02:27Z

According to the documentation (https://ocaml.org/p/zed/2.0.6/doc/Zed_string/index.html#val-of_utf8), Zed_string.of_utf8 should not raise Stdlib.Invalid_argument if the input is not valid. Uchar.is_valid returns false if the value is bigger than Uchar.max (U+10FFFF) or belongs to the invalid Unicode range (from U+D800 to U+DFFF). In this case, Zed_string.of_utf8 raises Zed_utf8.Invalid (input, "not a Unicode scalar value").

tmattio · 2023-03-06T23:16:27Z

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

Lucccyo · 2023-03-07T08:28:23Z

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

I think it is. Is it better than using Uchar.is_valid?

Lucccyo · 2023-03-07T10:36:01Z

We currently have four ranges (\x00-\x7f, \xc0-\xdf,\xe0-\xef and \xf0-\xf7) representing the four UTF8 encoding lengths (respectively 1 byte, 2 bytes, 3 bytes, and 4 bytes).
May adding new ranges to avoid the use of Uchar.is_valid make Zed_string.of_utf8 less understandable?

tmattio · 2023-03-07T14:45:47Z

I think it is. Is it better than using Uchar.is_valid?

I'm not sure. Having a pattern match on the range and an additional validity check feels a bit strange: I would expect to see one or the other, but why check for the range if, in the end, the range contains invalid characters?

On the other hand, having a final check will be more robust and prevent similar issues in the future.

Unless someone else has an opinion on this (cc @emillon), I'll let you decide what's better here

emillon · 2023-06-20T12:38:58Z

I think that there's something that's causing confusion but is not immediately clear from reading the code: on the one hand, there are byte ranges (what we're pattern matching on, the char type); on the other hand, there are codepoint ranges (based on Uchar.t). The set of valid Uchar.t values is 2 disjoint ranges (or, put differently, a range with a range-shaped hole in it). To convert from one to the other, some bit operations are necessary, so it would be awkward to do that at the byte level.

To make an analogy, decimal IPv4 addresses look like \d+\.\d+.\d+.\d+. You can somehow restrict the regexp to accept a maximum of 3 ints, or restrict the leading int to {1,2} but an explicit step with int is easier to reason about.

emillon · 2023-06-20T12:39:23Z

So yes I think these changes are good.

@Lucccyo

CHANGES: * `Zed_utf8.next_error`: raise `Zed_utf8.Out_of_bounds` in case of invalid offset (@Lucccyo, ocaml-community/zed#52) * `kill_next_word` should not raise `Out_of_bound` (@Lucccyo, ocaml-community/zed#55) * `of_utf8`: add `Uchar.is_valid` to check the input (@Lucccyo, ocaml-community/zed#51)

Lucccyo force-pushed the of_utf8_exception_fix branch from 37b9c5d to 124f9c5 Compare February 14, 2023 12:59

tmattio mentioned this pull request Mar 6, 2023

kill_next_word should not raise Out_of_bound #55

Merged

Lucccyo and others added 2 commits June 20, 2023 14:58

unsafe_extract exception fix

1de6c2d

Format

3d33071

tmattio force-pushed the of_utf8_exception_fix branch from 124f9c5 to 3d33071 Compare June 20, 2023 13:01

tmattio merged commit faf3017 into ocaml-community:master Jun 20, 2023

tmattio mentioned this pull request Jun 20, 2023

[new release] zed (3.2.2) ocaml/opam-repository#23968

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`of_utf8`: add `Uchar.is_valid` to check the input #51

`of_utf8`: add `Uchar.is_valid` to check the input #51

Lucccyo commented Feb 13, 2023

tmattio commented Mar 6, 2023

Lucccyo commented Mar 7, 2023

Lucccyo commented Mar 7, 2023

tmattio commented Mar 7, 2023

emillon commented Jun 20, 2023

emillon commented Jun 20, 2023

of_utf8: add Uchar.is_valid to check the input #51

of_utf8: add Uchar.is_valid to check the input #51

Conversation

Lucccyo commented Feb 13, 2023

tmattio commented Mar 6, 2023

Lucccyo commented Mar 7, 2023

Lucccyo commented Mar 7, 2023

tmattio commented Mar 7, 2023

emillon commented Jun 20, 2023

emillon commented Jun 20, 2023

`of_utf8`: add `Uchar.is_valid` to check the input #51

`of_utf8`: add `Uchar.is_valid` to check the input #51