Editorial: Reference leading/trailing surrogate definitions more #1532

gibson042 · 2019-05-09T04:06:13Z

No description provided.

ljharb · 2019-05-10T19:34:42Z

spec.html

-              1. Let _second_ be the numeric value of the code unit at index _position_ + 1 within the String _s_.
-              1. If _second_ &lt; 0xDC00 or _second_ &gt; 0xDFFF, let _resultString_ be the String value consisting of the single code unit _first_.
+              1. Let _second_ be the code unit at index _position_ + 1 within the String _s_.
+              1. If _second_ is not a <emu-xref href="#trailing-surrogate"></emu-xref>, let _resultString_ be the String value consisting of the single code unit _first_.
              1. Else, let _resultString_ be the string-concatenation of the code unit _first_ and the code unit _second_.


it kind of seems like we could benefit from an abstract operation that takes s, and position, and returns either the concatenation, or a List of first, second - then we could use it both here, and in codePointAt (and potentially other places)

It’d be nice to get that in this PR, if possible ;-)

Added in 9105de5, though I'm not sure whether the new operation belongs with UTF16Encoding and UTF16Decode under Section 10.1: Source Text or somewhere else, or even whether all three should move to something like a new "Code Point Operations" section under Section 7: Abstract Operations. Please share your thoughts.

Thanks, this is great. i don’t have an opinion on where it lives; alongside the utf16 ops seems fine for now.

spec.html

mathiasbynens · 2019-05-12T17:33:57Z

spec.html

+
+    <emu-clause id="sec-codepointat" aoid="CodePointAt">
+      <h1>Static Semantics: CodePointAt ( _string_, _position_ )</h1>
+      <p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>


Nit: UTF-16 encoded → UTF-16-encoded

Suggested change

<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16-encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

This one I will leave for now, because the "UTF-16 encoded" text is common to several other parts of the spec (e.g., ToNumber and many subsections of String objects).

jmdyck · 2019-05-13T22:42:27Z

I generally like the refactoring here, as it increases the encapsulation of ECMAScript's version of UTF-16 decoding. However, the lines that compare the numeric value of _cp_ to 0xFFFF are a bit of a leak in the abstraction.

Also, it seems a bit odd that, because CodePointAt calls UTF16Decode, %StringIteratorPrototype%.next has to then immediately call UTF16Encoding because it's only interested in the code units.

I like @ljharb's original suggestion where the abstract op returned the concatenation or a List of the contributing code units. (Not sure what you'd call it, maybe UTF16CodeUnitsAt.) We could still have CodePointAt().

gibson042 · 2019-05-14T20:20:23Z

The operations have similar but distinct requirements.

%StringIteratorPrototype%.next needs the count of code units to slice
AdvanceStringIndex needs the count for index advancement
String.prototype.codePointAt needs a numeric code point value
Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

The 1) CodePointAt operation I've just added reports the code point and lets the operations that need count infer it by comparison to the maximum single-code-unit value, which is a bit messy but IMO no worse than 2) $hardToName returning a List of code units and letting the operations that need code unit count get it from the number of elements and the operations that need a code point get it from UTF16Decode. I suppose the other alternatives would be 3) CodePointAt returning a Record with both kinds of data, or 4) $hardToName2 returning just a count of code units.

I like option 3, but it seems like overkill to me so I have fallen back on 1. If there is consensus on another option, though, I'll make the change.

ljharb · 2019-05-14T20:28:30Z

Option 3 does sound nice.

jmdyck · 2019-05-15T01:48:54Z

Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

Yeah, Encode is the real problem, because it's the only one that needs information at multiple "levels".

I like option 3 [returning a Record], but it seems like overkill to me

I suspect it wouldn't be that bad.

You could also fold in Encode's lone surrogate checking (another leak in the abstraction): a field of the record could indicate whether a properly-encoded code point was found. That would almost completely encapsulate ECMAScript's version of UTF-16 decoding.

spec.html

gibson042 · 2019-05-18T15:25:37Z

OK, updated CodePointAt to return a Record: 21e0abb...f2f6a60

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

If first is a trailing surrogate or at the end of the string, return $unpaired.

Read second.

If second is not a trailing surrogate, return $unpaired.

Let cp be UTF16Decode(first, second).

Return $surrogatePair.

spec.html

jmdyck · 2019-05-21T13:55:14Z

OK, updated CodePointAt to return a Record:

Yeah, I like this approach.

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

Personally, I think the suggested linearization would make the operation easier to understand. Duplicating the unpaired return doesn't bother me.

One thing that I think might increase readability would be to put each 'return' on its own (sub)step, because then:

depending on window width, you're more likely to get each Record constructor on a single line, and
the Record constructors will (mostly) line up vertically, making it easier to compare them.

jmdyck · 2019-05-21T23:51:35Z

Rename [[CodeUnitCount]] to [[Length]] for better small-window rendering

I've got mixed feelings about this, because CodeUnitCount was really clear about its meaning. Could you add a sentence to the preamble for CodePointAt, defining [[Length]]?

ljharb · 2019-05-22T00:34:27Z

I’m not super concerned about small window rendering, but I’m very concerned about self-documenting names - maybe we could flip that one back?

gibson042 · 2019-05-22T01:30:10Z

OK, fair enough. Reverted.

Notes: - Once PR tc39#1532 is merged, UTF16Decode can be made more precide using CodePointAt. - There are a few other places in the spec where UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern), but it's not clear to me how to use UTF16Decode there.

…9#1532)

- Make use of CodePointAt elsewhere in the spec

gibson042 added the editorial change label May 9, 2019

ljharb approved these changes May 10, 2019

View reviewed changes

ljharb requested review from zenparsing, a team and mathiasbynens May 10, 2019 19:35

mathiasbynens approved these changes May 12, 2019

View reviewed changes

mathiasbynens reviewed May 12, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

gibson042 force-pushed the 2019-05-surrogate-refs branch from af74864 to 9105de5 Compare May 12, 2019 05:53

mathiasbynens reviewed May 12, 2019

View reviewed changes

michaelficarra reviewed May 15, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

michaelficarra reviewed May 15, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

michaelficarra mentioned this pull request May 17, 2019

Editorial: remove usages of increase/increment and decrease/decrement #1542

Merged

ljharb reviewed May 18, 2019

View reviewed changes

spec.html Outdated Show resolved Hide resolved

jmdyck reviewed May 21, 2019

View reviewed changes

spec.html Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

zenparsing approved these changes May 29, 2019

View reviewed changes

ljharb self-assigned this May 29, 2019

gibson042 added 2 commits June 1, 2019 21:45

Editorial: Reference leading/trailing surrogate definitions more (tc3…

7deeb91

…9#1532)

Editorial: Add CodePointAt abstract operation (tc39#1532)

659fb6e

- Make use of CodePointAt elsewhere in the spec

ljharb force-pushed the 2019-05-surrogate-refs branch from 5bd2e7a to 659fb6e Compare June 2, 2019 04:45

ljharb merged commit 659fb6e into tc39:master Jun 2, 2019

jmdyck mentioned this pull request Jun 17, 2019

Editorial: Introduce abstract ops UTF16Encode + UTF16DecodeString #1552

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Reference leading/trailing surrogate definitions more #1532

Editorial: Reference leading/trailing surrogate definitions more #1532

gibson042 commented May 9, 2019

ljharb May 10, 2019

ljharb May 12, 2019

gibson042 May 12, 2019

ljharb May 12, 2019

mathiasbynens May 12, 2019 •

edited by ljharb

Loading

gibson042 May 13, 2019 •

edited

Loading

jmdyck commented May 13, 2019

gibson042 commented May 14, 2019 •

edited

Loading

ljharb commented May 14, 2019

jmdyck commented May 15, 2019

gibson042 commented May 18, 2019

jmdyck commented May 21, 2019

jmdyck commented May 21, 2019

ljharb commented May 22, 2019

gibson042 commented May 22, 2019

	<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>
	<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16-encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

Editorial: Reference leading/trailing surrogate definitions more #1532

Editorial: Reference leading/trailing surrogate definitions more #1532

Conversation

gibson042 commented May 9, 2019

ljharb May 10, 2019

Choose a reason for hiding this comment

ljharb May 12, 2019

Choose a reason for hiding this comment

gibson042 May 12, 2019

Choose a reason for hiding this comment

ljharb May 12, 2019

Choose a reason for hiding this comment

mathiasbynens May 12, 2019 • edited by ljharb Loading

Choose a reason for hiding this comment

gibson042 May 13, 2019 • edited Loading

Choose a reason for hiding this comment

jmdyck commented May 13, 2019

gibson042 commented May 14, 2019 • edited Loading

ljharb commented May 14, 2019

jmdyck commented May 15, 2019

gibson042 commented May 18, 2019

jmdyck commented May 21, 2019

jmdyck commented May 21, 2019

ljharb commented May 22, 2019

gibson042 commented May 22, 2019

mathiasbynens May 12, 2019 •

edited by ljharb

Loading

gibson042 May 13, 2019 •

edited

Loading

gibson042 commented May 14, 2019 •

edited

Loading