Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

jmdyck · 2019-05-27T04:12:13Z

The main commits are the 1 and 3, which define UTF16Encode and UTF16Decode respectively. Everything else is related editorial changes.

(Yes, there's already an op called UTF16Decode, but commit 2 renames it to something more appropriate.
So this changes the signature/behavior of UTF16Decode, but Web IDL and HTML don't reference it.)

annevk · 2019-05-27T08:33:02Z

I wish we could use UTF-16 less for this and instead have StringToCodePoints, CodePointsToString, and CodePointToCodeUnits or some such.

jmdyck · 2019-05-27T14:20:42Z

I was thinking of calling them SourceTextToString and StringToSourceText. One thing that stopped me was that those names suggest there's only a single way to get from each type to the other, but that's not the case. E.g., RegExpInitialize shows a different way to get from a String to a SourceText. (On the other hand, that's a very isolated case.)

So yeah, if the editors want different names, I'm okay with that.

spec.html

annevk · 2019-05-28T07:25:43Z

I kinda wanna object to spreading the misuse of UTF-16 further, but overall I guess this is preferable to what was there before...

gibson042

I like the differentiation this introduces between *String variables containing sequences of code units and *Text variables containing sequences of code points.

spec.html

jmdyck · 2019-06-17T03:26:33Z

Force-pushed to:

resolve merge conflicts,
put "!" before each call to UTF16Encode (as requested),
change (new) UTF16Decode to UTF16DecodeString (as requested), and
define UTF16DecodeString algorithmically using CodePointAt (now that PR Editorial: Reference leading/trailing surrogate definitions more #1532 has been merged).

jmdyck · 2019-06-17T17:29:04Z

(force-pushed to resolve new conflicts and do small cleanup)

jmdyck · 2019-06-20T03:23:13Z

(force-pushed to resolve new merge conflicts)

jmdyck · 2019-10-24T00:47:22Z

(force-pushed to resolve merge conflicts)

michaelficarra · 2019-10-25T01:33:00Z

spec.html

@@ -10942,7 +10965,9 @@ <h1>Static Semantics: StringValue</h1>
            IdentifierName IdentifierPart
        </emu-grammar>
        <emu-alg>
-          1. Return the String value consisting of the sequence of code units corresponding to |IdentifierName|. In determining the sequence any occurrences of `\\` |UnicodeEscapeSequence| are first replaced with the code point represented by the |UnicodeEscapeSequence| and then the code points of the entire |IdentifierName| are converted to code units by UTF16Encoding each code point.
+          1. Let _idText_ be the source text matched by |IdentifierName|.
+          1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_ with the code point represented by the |UnicodeEscapeSequence|.


the code point represented by the |UnicodeEscapeSequence|

This language is a little less precise than I would prefer. Could we define a syntax-directed operation like

The CPV of UnicodeEscapeSequence :: u Hex4Digits is the code point whose value is (0x1000ℝ times the MV of the first HexDigit) plus (0x100ℝ times the MV of the second HexDigit) plus (0x10ℝ times the MV of the third HexDigit) plus the MV of the fourth HexDigit. The CPV of UnicodeEscapeSequence :: u { CodePoint } is the code point whose value is the MV of CodePoint.

We could also extract that Hex4Digits bit into MV so we don't do the same maths in two places (I mostly copied it out of SV).

I like the suggestion, and I'm willing to work on it (have already started), but it's independent of this PR, and it seems to me that doing it properly will be a larger change. So I think it should be in a separate PR.

Okay I'm fine with it being done in a separate PR since this PR isn't making anything worse than it already is.

Thanks. (BTW, in my preliminary work, I discovered that the spec was already assuming the existence of MV for Hex4Digits, thus this addition to PR #1301.)

michaelficarra · 2019-10-25T01:37:12Z

spec.html

@@ -30865,7 +30890,9 @@ <h1>Static Semantics: StringValue</h1>
            RegExpIdentifierName[?U] RegExpIdentifierPart[?U]
        </emu-grammar>
        <emu-alg>
-          1. Return the String value consisting of the sequence of code units corresponding to |RegExpIdentifierName|. In determining the sequence any occurrences of `\\` |RegExpUnicodeEscapeSequence| are first replaced with the code point represented by the |RegExpUnicodeEscapeSequence| and then the code points of the entire |RegExpIdentifierName| are converted to code units by UTF16Encoding each code point.
+          1. Let _idText_ be the source text matched by |RegExpIdentifierName|.
+          1. Let _idTextUnescaped_ be the result of replacing any occurrences of `\\` |RegExpUnicodeEscapeSequence| in _idText_ with the code point represented by the |RegExpUnicodeEscapeSequence|.


Same here, we could define CPV for each of the alternatives in RegExpUnicodeEscapeSequence. I would much prefer that to the current text.

michaelficarra

LGTM otherwise. This is a great change.

michaelficarra · 2019-11-01T19:27:37Z

@jmdyck If you don't have time, and would be okay giving me permissions to push to your branch, I can do this for you.

ljharb · 2019-11-02T04:55:38Z

@michaelficarra since you have write on this repo, you should already have those permissions by default.

waiting on follow-up PR

jmdyck · 2019-11-15T03:19:12Z

force-pushed to resolve merge conflicts (in the 4th commit, from PR 1479)

michaelficarra · 2020-01-30T02:43:58Z

Ping @syg @bakkot please review. I'd prefer to get this in before #1547.

spec.html

bakkot · 2020-02-01T06:47:21Z

spec.html

-    <emu-clause id="sec-utf16decode" aoid="UTF16Decode">
-      <h1>Static Semantics: UTF16Decode ( _lead_, _trail_ )</h1>
+    <emu-clause id="sec-utf16encode" aoid="UTF16Encode">
+      <h1>Static Semantics: UTF16Encode ( _text_ )</h1>


I feel somewhat negatively about having both UTF16Encoding and UTF16Encode (because the names are too similar). Unfortunately I don't have an obviously superior name to suggest. UTF16EncodeText, perhaps?

bakkot

LGTM other than the naming thing, which I don't want to block this on: I would prefer we just land this and perhaps later improve the name, unless someone thinks there's an obviously correct fix.

... because the former name is misleadingly general.

Note: There are a few other places in the spec where UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern), but it's not clear to me how to use UTF16DecodeString there.

... to reserve that form for aliases whose value is a source text (sequence of Unicode code points).

(It's only meaningful to parse something as an ES |Script| if that thing is ES source text (sequence of Unicode code points), not an ES String value.)

I.e., first have a step where we determine the text that we're going to parse (which, for the _BMP_ = *true* case, takes a couple sentences to describe), and *then* have the step where we parse it.

ljharb reviewed May 27, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

ljharb added the editorial change label May 27, 2019

gibson042 reviewed May 29, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

jmdyck force-pushed the UTF16 branch from 2df8b27 to 29e11df Compare June 17, 2019 03:22

jmdyck force-pushed the UTF16 branch from 29e11df to 29bdf3f Compare June 17, 2019 17:28

jmdyck force-pushed the UTF16 branch from 29bdf3f to 524277f Compare June 20, 2019 03:22

ljharb approved these changes Aug 23, 2019

View reviewed changes

ljharb requested review from zenparsing, mathiasbynens and a team August 23, 2019 04:08

jmdyck mentioned this pull request Sep 14, 2019

Editorial: fix inconsistency re type of [[SourceText]] #1547

Merged

jmdyck force-pushed the UTF16 branch from 524277f to c5b7bda Compare October 24, 2019 00:46

ljharb requested review from syg, michaelficarra and bakkot and removed request for zenparsing October 24, 2019 22:24

michaelficarra reviewed Oct 25, 2019

View reviewed changes

michaelficarra previously requested changes Oct 25, 2019

View reviewed changes

michaelficarra approved these changes Nov 8, 2019

View reviewed changes

ljharb removed the request for review from a team November 9, 2019 04:09

ljharb self-assigned this Nov 9, 2019

jmdyck force-pushed the UTF16 branch from c5b7bda to bd4f8f2 Compare November 15, 2019 03:17

jmdyck mentioned this pull request Nov 15, 2019

Editorial: correctly determine flags for static RegExp parsing #1464

Merged

jmdyck changed the title ~~Editorial: Introduce abstract ops UTF16Encode + UTF16Decode~~ Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString Jan 30, 2020

bakkot reviewed Feb 1, 2020

View reviewed changes

spec.html Outdated Show resolved Hide resolved

bakkot reviewed Feb 1, 2020

View reviewed changes

bakkot approved these changes Feb 1, 2020

View reviewed changes

jmdyck added 6 commits January 31, 2020 23:06

Editorial: Define + use abstract op UTF16Encode (tc39#1552)

0aec1df

Editorial: Rename UTF16Decode to UTF16DecodeSurrogatePair (tc39#1552)

662f099

... because the former name is misleadingly general.

Editorial: Define + use abstract op UTF16DecodeString (tc39#1552)

335dec3

Note: There are a few other places in the spec where UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern), but it's not clear to me how to use UTF16DecodeString there.

Editorial: Rename some _fooText_ aliases (tc39#1552)

f9682ae

... to reserve that form for aliases whose value is a source text (sequence of Unicode code points).

Editorial: JSON.parse: insert missing UTF16DecodeString call (tc39#1552)

0012b7a

(It's only meaningful to parse something as an ES |Script| if that thing is ES source text (sequence of Unicode code points), not an ES String value.)

Editorial: factor out _pText_ in RegExpInitialize (tc39#1552)

1a4fa5f

I.e., first have a step where we determine the text that we're going to parse (which, for the _BMP_ = *true* case, takes a couple sentences to describe), and *then* have the step where we parse it.

ljharb force-pushed the UTF16 branch from bd4f8f2 to 1a4fa5f Compare February 1, 2020 07:07

ljharb merged commit 1a4fa5f into tc39:master Feb 1, 2020

bakkot mentioned this pull request Feb 1, 2020

editorial nit: having both UTF16Encode and UTF16Encoding is weird #1863

Closed

jmdyck deleted the UTF16 branch February 6, 2020 21:12

jmdyck mentioned this pull request Apr 24, 2021

Editorial: Improve specification of [RegExp]IdentifierNames #2392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

jmdyck commented May 27, 2019

annevk commented May 27, 2019

jmdyck commented May 27, 2019

annevk commented May 28, 2019

gibson042 left a comment •

edited

Loading

jmdyck commented Jun 17, 2019

jmdyck commented Jun 17, 2019

jmdyck commented Jun 20, 2019

jmdyck commented Oct 24, 2019

michaelficarra Oct 25, 2019 •

edited

Loading

jmdyck Nov 4, 2019

michaelficarra Nov 4, 2019

jmdyck Nov 4, 2019

michaelficarra Oct 25, 2019

michaelficarra left a comment

michaelficarra commented Nov 1, 2019

ljharb commented Nov 2, 2019

jmdyck commented Nov 15, 2019

michaelficarra commented Jan 30, 2020

bakkot Feb 1, 2020

bakkot left a comment

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

Conversation

jmdyck commented May 27, 2019

annevk commented May 27, 2019

jmdyck commented May 27, 2019

annevk commented May 28, 2019

gibson042 left a comment • edited Loading

Choose a reason for hiding this comment

jmdyck commented Jun 17, 2019

jmdyck commented Jun 17, 2019

jmdyck commented Jun 20, 2019

jmdyck commented Oct 24, 2019

michaelficarra Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

jmdyck Nov 4, 2019

Choose a reason for hiding this comment

michaelficarra Nov 4, 2019

Choose a reason for hiding this comment

jmdyck Nov 4, 2019

Choose a reason for hiding this comment

michaelficarra Oct 25, 2019

Choose a reason for hiding this comment

michaelficarra left a comment

Choose a reason for hiding this comment

michaelficarra commented Nov 1, 2019

ljharb commented Nov 2, 2019

jmdyck commented Nov 15, 2019

michaelficarra commented Jan 30, 2020

bakkot Feb 1, 2020

Choose a reason for hiding this comment

bakkot left a comment

Choose a reason for hiding this comment

gibson042 left a comment •

edited

Loading

michaelficarra Oct 25, 2019 •

edited

Loading