Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

Merged
merged 6 commits into from
Feb 1, 2020

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented May 27, 2019

The main commits are the 1 and 3, which define UTF16Encode and UTF16Decode respectively. Everything else is related editorial changes.

(Yes, there's already an op called UTF16Decode, but commit 2 renames it to something more appropriate.
So this changes the signature/behavior of UTF16Decode, but Web IDL and HTML don't reference it.)

@annevk
Copy link
Member

annevk commented May 27, 2019

I wish we could use UTF-16 less for this and instead have StringToCodePoints, CodePointsToString, and CodePointToCodeUnits or some such.

@jmdyck
Copy link
Collaborator Author

jmdyck commented May 27, 2019

I was thinking of calling them SourceTextToString and StringToSourceText. One thing that stopped me was that those names suggest there's only a single way to get from each type to the other, but that's not the case. E.g., RegExpInitialize shows a different way to get from a String to a SourceText. (On the other hand, that's a very isolated case.)

So yeah, if the editors want different names, I'm okay with that.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
@annevk
Copy link
Member

annevk commented May 28, 2019

I kinda wanna object to spreading the misuse of UTF-16 further, but overall I guess this is preferable to what was there before...

Copy link
Contributor

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the differentiation this introduces between *String variables containing sequences of code units and *Text variables containing sequences of code points.

spec.html Outdated Show resolved Hide resolved
@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 17, 2019

Force-pushed to:

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 17, 2019

(force-pushed to resolve new conflicts and do small cleanup)

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 20, 2019

(force-pushed to resolve new merge conflicts)

@jmdyck
Copy link
Collaborator Author

jmdyck commented Oct 24, 2019

(force-pushed to resolve merge conflicts)

@ljharb ljharb requested review from syg, michaelficarra and bakkot and removed request for zenparsing October 24, 2019 22:24
spec.html Outdated
@@ -10942,7 +10965,9 @@ <h1>Static Semantics: StringValue</h1>
IdentifierName IdentifierPart
</emu-grammar>
<emu-alg>
1. Return the String value consisting of the sequence of code units corresponding to |IdentifierName|. In determining the sequence any occurrences of `\\` |UnicodeEscapeSequence| are first replaced with the code point represented by the |UnicodeEscapeSequence| and then the code points of the entire |IdentifierName| are converted to code units by UTF16Encoding each code point.
1. Let _idText_ be the source text matched by |IdentifierName|.
1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_ with the code point represented by the |UnicodeEscapeSequence|.
Copy link
Member

@michaelficarra michaelficarra Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code point represented by the |UnicodeEscapeSequence|

This language is a little less precise than I would prefer. Could we define a syntax-directed operation like

The CPV of UnicodeEscapeSequence :: u Hex4Digits is the code point whose value is (0x1000ℝ times the MV of the first HexDigit) plus (0x100ℝ times the MV of the second HexDigit) plus (0x10ℝ times the MV of the third HexDigit) plus the MV of the fourth HexDigit.
The CPV of UnicodeEscapeSequence :: u { CodePoint } is the code point whose value is the MV of CodePoint.

We could also extract that Hex4Digits bit into MV so we don't do the same maths in two places (I mostly copied it out of SV).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion, and I'm willing to work on it (have already started), but it's independent of this PR, and it seems to me that doing it properly will be a larger change. So I think it should be in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'm fine with it being done in a separate PR since this PR isn't making anything worse than it already is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. (BTW, in my preliminary work, I discovered that the spec was already assuming the existence of MV for Hex4Digits, thus this addition to PR #1301.)

spec.html Outdated
@@ -30865,7 +30890,9 @@ <h1>Static Semantics: StringValue</h1>
RegExpIdentifierName[?U] RegExpIdentifierPart[?U]
</emu-grammar>
<emu-alg>
1. Return the String value consisting of the sequence of code units corresponding to |RegExpIdentifierName|. In determining the sequence any occurrences of `\\` |RegExpUnicodeEscapeSequence| are first replaced with the code point represented by the |RegExpUnicodeEscapeSequence| and then the code points of the entire |RegExpIdentifierName| are converted to code units by UTF16Encoding each code point.
1. Let _idText_ be the source text matched by |RegExpIdentifierName|.
1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |RegExpUnicodeEscapeSequence| in _idText_ with the code point represented by the |RegExpUnicodeEscapeSequence|.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we could define CPV for each of the alternatives in RegExpUnicodeEscapeSequence. I would much prefer that to the current text.

Copy link
Member

@michaelficarra michaelficarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. This is a great change.

@michaelficarra
Copy link
Member

@jmdyck If you don't have time, and would be okay giving me permissions to push to your branch, I can do this for you.

@ljharb
Copy link
Member

ljharb commented Nov 2, 2019

@michaelficarra since you have write on this repo, you should already have those permissions by default.

@michaelficarra michaelficarra dismissed their stale review November 4, 2019 17:10

waiting on follow-up PR

@ljharb ljharb removed the request for review from a team November 9, 2019 04:09
@ljharb ljharb self-assigned this Nov 9, 2019
@jmdyck
Copy link
Collaborator Author

jmdyck commented Nov 15, 2019

force-pushed to resolve merge conflicts (in the 4th commit, from PR 1479)

@michaelficarra
Copy link
Member

Ping @syg @bakkot please review. I'd prefer to get this in before #1547.

@jmdyck jmdyck changed the title Editorial: Introduce abstract ops UTF16Encode + UTF16Decode Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString Jan 30, 2020
spec.html Outdated Show resolved Hide resolved
spec.html Outdated
<emu-clause id="sec-utf16decode" aoid="UTF16Decode">
<h1>Static Semantics: UTF16Decode ( _lead_, _trail_ )</h1>
<emu-clause id="sec-utf16encode" aoid="UTF16Encode">
<h1>Static Semantics: UTF16Encode ( _text_ )</h1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel somewhat negatively about having both UTF16Encoding and UTF16Encode (because the names are too similar). Unfortunately I don't have an obviously superior name to suggest. UTF16EncodeText, perhaps?

Copy link
Contributor

@bakkot bakkot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the naming thing, which I don't want to block this on: I would prefer we just land this and perhaps later improve the name, unless someone thinks there's an obviously correct fix.

... because the former name is misleadingly general.
Note: There are a few other places in the spec where
UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern),
but it's not clear to me how to use UTF16DecodeString there.
... to reserve that form for aliases whose value is
a source text (sequence of Unicode code points).
(It's only meaningful to parse something as an ES |Script|
if that thing is ES source text (sequence of Unicode code points),
not an ES String value.)
I.e., first have a step where we determine the text that we're going to parse
(which, for the _BMP_ = *true* case, takes a couple sentences to describe),
and *then* have the step where we parse it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants