-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552
Conversation
I wish we could use UTF-16 less for this and instead have StringToCodePoints, CodePointsToString, and CodePointToCodeUnits or some such. |
I was thinking of calling them SourceTextToString and StringToSourceText. One thing that stopped me was that those names suggest there's only a single way to get from each type to the other, but that's not the case. E.g., RegExpInitialize shows a different way to get from a String to a SourceText. (On the other hand, that's a very isolated case.) So yeah, if the editors want different names, I'm okay with that. |
I kinda wanna object to spreading the misuse of UTF-16 further, but overall I guess this is preferable to what was there before... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the differentiation this introduces between *String variables containing sequences of code units and *Text variables containing sequences of code points.
Force-pushed to:
|
(force-pushed to resolve new conflicts and do small cleanup) |
(force-pushed to resolve new merge conflicts) |
(force-pushed to resolve merge conflicts) |
spec.html
Outdated
@@ -10942,7 +10965,9 @@ <h1>Static Semantics: StringValue</h1> | |||
IdentifierName IdentifierPart | |||
</emu-grammar> | |||
<emu-alg> | |||
1. Return the String value consisting of the sequence of code units corresponding to |IdentifierName|. In determining the sequence any occurrences of `\\` |UnicodeEscapeSequence| are first replaced with the code point represented by the |UnicodeEscapeSequence| and then the code points of the entire |IdentifierName| are converted to code units by UTF16Encoding each code point. | |||
1. Let _idText_ be the source text matched by |IdentifierName|. | |||
1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_ with the code point represented by the |UnicodeEscapeSequence|. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code point represented by the |UnicodeEscapeSequence|
This language is a little less precise than I would prefer. Could we define a syntax-directed operation like
The CPV of UnicodeEscapeSequence :: u Hex4Digits is the code point whose value is (0x1000ℝ times the MV of the first HexDigit) plus (0x100ℝ times the MV of the second HexDigit) plus (0x10ℝ times the MV of the third HexDigit) plus the MV of the fourth HexDigit.
The CPV of UnicodeEscapeSequence :: u { CodePoint } is the code point whose value is the MV of CodePoint.
We could also extract that Hex4Digits bit into MV so we don't do the same maths in two places (I mostly copied it out of SV).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion, and I'm willing to work on it (have already started), but it's independent of this PR, and it seems to me that doing it properly will be a larger change. So I think it should be in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'm fine with it being done in a separate PR since this PR isn't making anything worse than it already is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. (BTW, in my preliminary work, I discovered that the spec was already assuming the existence of MV for Hex4Digits
, thus this addition to PR #1301.)
spec.html
Outdated
@@ -30865,7 +30890,9 @@ <h1>Static Semantics: StringValue</h1> | |||
RegExpIdentifierName[?U] RegExpIdentifierPart[?U] | |||
</emu-grammar> | |||
<emu-alg> | |||
1. Return the String value consisting of the sequence of code units corresponding to |RegExpIdentifierName|. In determining the sequence any occurrences of `\\` |RegExpUnicodeEscapeSequence| are first replaced with the code point represented by the |RegExpUnicodeEscapeSequence| and then the code points of the entire |RegExpIdentifierName| are converted to code units by UTF16Encoding each code point. | |||
1. Let _idText_ be the source text matched by |RegExpIdentifierName|. | |||
1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |RegExpUnicodeEscapeSequence| in _idText_ with the code point represented by the |RegExpUnicodeEscapeSequence|. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we could define CPV for each of the alternatives in RegExpUnicodeEscapeSequence. I would much prefer that to the current text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. This is a great change.
@jmdyck If you don't have time, and would be okay giving me permissions to push to your branch, I can do this for you. |
@michaelficarra since you have write on this repo, you should already have those permissions by default. |
force-pushed to resolve merge conflicts (in the 4th commit, from PR 1479) |
spec.html
Outdated
<emu-clause id="sec-utf16decode" aoid="UTF16Decode"> | ||
<h1>Static Semantics: UTF16Decode ( _lead_, _trail_ )</h1> | ||
<emu-clause id="sec-utf16encode" aoid="UTF16Encode"> | ||
<h1>Static Semantics: UTF16Encode ( _text_ )</h1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel somewhat negatively about having both UTF16Encoding
and UTF16Encode
(because the names are too similar). Unfortunately I don't have an obviously superior name to suggest. UTF16EncodeText
, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the naming thing, which I don't want to block this on: I would prefer we just land this and perhaps later improve the name, unless someone thinks there's an obviously correct fix.
... because the former name is misleadingly general.
Note: There are a few other places in the spec where UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern), but it's not clear to me how to use UTF16DecodeString there.
... to reserve that form for aliases whose value is a source text (sequence of Unicode code points).
(It's only meaningful to parse something as an ES |Script| if that thing is ES source text (sequence of Unicode code points), not an ES String value.)
I.e., first have a step where we determine the text that we're going to parse (which, for the _BMP_ = *true* case, takes a couple sentences to describe), and *then* have the step where we parse it.
The main commits are the 1 and 3, which define UTF16Encode and UTF16Decode respectively. Everything else is related editorial changes.
(Yes, there's already an op called UTF16Decode, but commit 2 renames it to something more appropriate.
So this changes the signature/behavior of UTF16Decode, but Web IDL and HTML don't reference it.)