Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Reference leading/trailing surrogate definitions more #1532

Merged
merged 2 commits into from
Jun 2, 2019

Conversation

gibson042
Copy link
Contributor

No description provided.

spec.html Outdated
1. Let _second_ be the numeric value of the code unit at index _position_ + 1 within the String _s_.
1. If _second_ < 0xDC00 or _second_ > 0xDFFF, let _resultString_ be the String value consisting of the single code unit _first_.
1. Let _second_ be the code unit at index _position_ + 1 within the String _s_.
1. If _second_ is not a <emu-xref href="#trailing-surrogate"></emu-xref>, let _resultString_ be the String value consisting of the single code unit _first_.
1. Else, let _resultString_ be the string-concatenation of the code unit _first_ and the code unit _second_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it kind of seems like we could benefit from an abstract operation that takes s, and position, and returns either the concatenation, or a List of first, second - then we could use it both here, and in codePointAt (and potentially other places)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’d be nice to get that in this PR, if possible ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9105de5, though I'm not sure whether the new operation belongs with UTF16Encoding and UTF16Decode under Section 10.1: Source Text or somewhere else, or even whether all three should move to something like a new "Code Point Operations" section under Section 7: Abstract Operations. Please share your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great. i don’t have an opinion on where it lives; alongside the utf16 ops seems fine for now.

@ljharb ljharb requested review from zenparsing, a team and mathiasbynens May 10, 2019 19:35
@gibson042 gibson042 force-pushed the 2019-05-surrogate-refs branch from af74864 to 9105de5 Compare May 12, 2019 05:53

<emu-clause id="sec-codepointat" aoid="CodePointAt">
<h1>Static Semantics: CodePointAt ( _string_, _position_ )</h1>
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>
Copy link
Member

@mathiasbynens mathiasbynens May 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: UTF-16 encodedUTF-16-encoded

Suggested change
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16-encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

Copy link
Contributor Author

@gibson042 gibson042 May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I will leave for now, because the "UTF-16 encoded" text is common to several other parts of the spec (e.g., ToNumber and many subsections of String objects).

@jmdyck
Copy link
Collaborator

jmdyck commented May 13, 2019

I generally like the refactoring here, as it increases the encapsulation of ECMAScript's version of UTF-16 decoding. However, the lines that compare the numeric value of _cp_ to 0xFFFF are a bit of a leak in the abstraction.

Also, it seems a bit odd that, because CodePointAt calls UTF16Decode, %StringIteratorPrototype%.next has to then immediately call UTF16Encoding because it's only interested in the code units.

I like @ljharb's original suggestion where the abstract op returned the concatenation or a List of the contributing code units. (Not sure what you'd call it, maybe UTF16CodeUnitsAt.) We could still have CodePointAt().

@gibson042
Copy link
Contributor Author

gibson042 commented May 14, 2019

The operations have similar but distinct requirements.

  • %StringIteratorPrototype%.next needs the count of code units to slice
  • AdvanceStringIndex needs the count for index advancement
  • String.prototype.codePointAt needs a numeric code point value
  • Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

The 1) CodePointAt operation I've just added reports the code point and lets the operations that need count infer it by comparison to the maximum single-code-unit value, which is a bit messy but IMO no worse than 2) $hardToName returning a List of code units and letting the operations that need code unit count get it from the number of elements and the operations that need a code point get it from UTF16Decode. I suppose the other alternatives would be 3) CodePointAt returning a Record with both kinds of data, or 4) $hardToName2 returning just a count of code units.

I like option 3, but it seems like overkill to me so I have fallen back on 1. If there is consensus on another option, though, I'll make the change.

@ljharb
Copy link
Member

ljharb commented May 14, 2019

Option 3 does sound nice.

@jmdyck
Copy link
Collaborator

jmdyck commented May 15, 2019

Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

Yeah, Encode is the real problem, because it's the only one that needs information at multiple "levels".

I like option 3 [returning a Record], but it seems like overkill to me

I suspect it wouldn't be that bad.

You could also fold in Encode's lone surrogate checking (another leak in the abstraction): a field of the record could indicate whether a properly-encoded code point was found. That would almost completely encapsulate ECMAScript's version of UTF-16 decoding.

@gibson042
Copy link
Contributor Author

OK, updated CodePointAt to return a Record: 21e0abb...f2f6a60

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

  1. If first is a trailing surrogate or at the end of the string, return $unpaired.
  2. Read second.
  3. If second is not a trailing surrogate, return $unpaired.
  4. Let cp be UTF16Decode(first, second).
  5. Return $surrogatePair.

@jmdyck
Copy link
Collaborator

jmdyck commented May 21, 2019

OK, updated CodePointAt to return a Record:

Yeah, I like this approach.

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

Personally, I think the suggested linearization would make the operation easier to understand. Duplicating the unpaired return doesn't bother me.

One thing that I think might increase readability would be to put each 'return' on its own (sub)step, because then:

  • depending on window width, you're more likely to get each Record constructor on a single line, and
  • the Record constructors will (mostly) line up vertically, making it easier to compare them.

@jmdyck
Copy link
Collaborator

jmdyck commented May 21, 2019

Rename [[CodeUnitCount]] to [[Length]] for better small-window rendering

I've got mixed feelings about this, because CodeUnitCount was really clear about its meaning. Could you add a sentence to the preamble for CodePointAt, defining [[Length]]?

@ljharb
Copy link
Member

ljharb commented May 22, 2019

I’m not super concerned about small window rendering, but I’m very concerned about self-documenting names - maybe we could flip that one back?

@gibson042
Copy link
Contributor Author

OK, fair enough. Reverted.

jmdyck added a commit to jmdyck/ecma262 that referenced this pull request May 27, 2019
Notes:

 - Once PR tc39#1532 is merged, UTF16Decode can be made
   more precide using CodePointAt.

 - There are a few other places in the spec where
   UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern),
   but it's not clear to me how to use UTF16Decode there.
@ljharb ljharb self-assigned this May 29, 2019
@ljharb ljharb force-pushed the 2019-05-surrogate-refs branch from 5bd2e7a to 659fb6e Compare June 2, 2019 04:45
@ljharb ljharb merged commit 659fb6e into tc39:master Jun 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants