Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

of_utf8: add Uchar.is_valid to check the input #51

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

Lucccyo
Copy link
Contributor

@Lucccyo Lucccyo commented Feb 13, 2023

According to the documentation (https://ocaml.org/p/zed/2.0.6/doc/Zed_string/index.html#val-of_utf8), Zed_string.of_utf8 should not raise Stdlib.Invalid_argument if the input is not valid. Uchar.is_valid returns false if the value is bigger than Uchar.max (U+10FFFF) or belongs to the invalid Unicode range (from U+D800 to U+DFFF). In this case, Zed_string.of_utf8 raises Zed_utf8.Invalid (input, "not a Unicode scalar value").

@Lucccyo Lucccyo force-pushed the of_utf8_exception_fix branch from 37b9c5d to 124f9c5 Compare February 14, 2023 12:59
@tmattio
Copy link
Collaborator

tmattio commented Mar 6, 2023

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Mar 7, 2023

IIUC, the problem is that the ranges '\xe0' .. '\xef' and '\xf0' .. '\xf7' contain invalid characters (namely U+10FFFF and U+D800 to U+DFFF from the description of the PR).

Instead of converting the characters in these ranges and checking their validity, is it possible to change the ranges in the pattern match to extract the invalid characters into separate match cases?

I think it is. Is it better than using Uchar.is_valid?

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Mar 7, 2023

We currently have four ranges (\x00-\x7f, \xc0-\xdf,\xe0-\xef and \xf0-\xf7) representing the four UTF8 encoding lengths (respectively 1 byte, 2 bytes, 3 bytes, and 4 bytes).
May adding new ranges to avoid the use of Uchar.is_valid make Zed_string.of_utf8 less understandable?

@tmattio
Copy link
Collaborator

tmattio commented Mar 7, 2023

I think it is. Is it better than using Uchar.is_valid?

I'm not sure. Having a pattern match on the range and an additional validity check feels a bit strange: I would expect to see one or the other, but why check for the range if, in the end, the range contains invalid characters?

On the other hand, having a final check will be more robust and prevent similar issues in the future.

Unless someone else has an opinion on this (cc @emillon), I'll let you decide what's better here

@emillon
Copy link
Contributor

emillon commented Jun 20, 2023

I think that there's something that's causing confusion but is not immediately clear from reading the code: on the one hand, there are byte ranges (what we're pattern matching on, the char type); on the other hand, there are codepoint ranges (based on Uchar.t). The set of valid Uchar.t values is 2 disjoint ranges (or, put differently, a range with a range-shaped hole in it). To convert from one to the other, some bit operations are necessary, so it would be awkward to do that at the byte level.

To make an analogy, decimal IPv4 addresses look like \d+\.\d+.\d+.\d+. You can somehow restrict the regexp to accept a maximum of 3 ints, or restrict the leading int to {1,2} but an explicit step with int is easier to reason about.

@emillon
Copy link
Contributor

emillon commented Jun 20, 2023

So yes I think these changes are good.

@tmattio tmattio force-pushed the of_utf8_exception_fix branch from 124f9c5 to 3d33071 Compare June 20, 2023 13:01
@tmattio tmattio merged commit faf3017 into ocaml-community:master Jun 20, 2023
tmattio added a commit to tmattio/opam-repository that referenced this pull request Jun 20, 2023
CHANGES:

* `Zed_utf8.next_error`: raise `Zed_utf8.Out_of_bounds` in case of invalid offset (@Lucccyo, ocaml-community/zed#52)
* `kill_next_word` should not raise `Out_of_bound` (@Lucccyo, ocaml-community/zed#55)
* `of_utf8`: add `Uchar.is_valid` to check the input (@Lucccyo, ocaml-community/zed#51)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants