-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Reference leading/trailing surrogate definitions more #1532
Conversation
spec.html
Outdated
1. Let _second_ be the numeric value of the code unit at index _position_ + 1 within the String _s_. | ||
1. If _second_ < 0xDC00 or _second_ > 0xDFFF, let _resultString_ be the String value consisting of the single code unit _first_. | ||
1. Let _second_ be the code unit at index _position_ + 1 within the String _s_. | ||
1. If _second_ is not a <emu-xref href="#trailing-surrogate"></emu-xref>, let _resultString_ be the String value consisting of the single code unit _first_. | ||
1. Else, let _resultString_ be the string-concatenation of the code unit _first_ and the code unit _second_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it kind of seems like we could benefit from an abstract operation that takes s
, and position
, and returns either the concatenation, or a List of first, second
- then we could use it both here, and in codePointAt (and potentially other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’d be nice to get that in this PR, if possible ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 9105de5, though I'm not sure whether the new operation belongs with UTF16Encoding and UTF16Decode under Section 10.1: Source Text or somewhere else, or even whether all three should move to something like a new "Code Point Operations" section under Section 7: Abstract Operations. Please share your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is great. i don’t have an opinion on where it lives; alongside the utf16 ops seems fine for now.
af74864
to
9105de5
Compare
|
||
<emu-clause id="sec-codepointat" aoid="CodePointAt"> | ||
<h1>Static Semantics: CodePointAt ( _string_, _position_ )</h1> | ||
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: UTF-16 encoded
→ UTF-16-encoded
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p> | |
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16-encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I will leave for now, because the "UTF-16 encoded" text is common to several other parts of the spec (e.g., ToNumber and many subsections of String objects).
I generally like the refactoring here, as it increases the encapsulation of ECMAScript's version of UTF-16 decoding. However, the lines that compare Also, it seems a bit odd that, because I like @ljharb's original suggestion where the abstract op returned the concatenation or a List of the contributing code units. (Not sure what you'd call it, maybe |
The operations have similar but distinct requirements.
The 1) CodePointAt operation I've just added reports the code point and lets the operations that need count infer it by comparison to the maximum single-code-unit value, which is a bit messy but IMO no worse than 2) $hardToName returning a List of code units and letting the operations that need code unit count get it from the number of elements and the operations that need a code point get it from UTF16Decode. I suppose the other alternatives would be 3) CodePointAt returning a Record with both kinds of data, or 4) $hardToName2 returning just a count of code units. I like option 3, but it seems like overkill to me so I have fallen back on 1. If there is consensus on another option, though, I'll make the change. |
Option 3 does sound nice. |
Yeah, Encode is the real problem, because it's the only one that needs information at multiple "levels".
I suspect it wouldn't be that bad. You could also fold in Encode's lone surrogate checking (another leak in the abstraction): a field of the record could indicate whether a properly-encoded code point was found. That would almost completely encapsulate ECMAScript's version of UTF-16 decoding. |
OK, updated CodePointAt to return a Record: 21e0abb...f2f6a60 The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:
|
Yeah, I like this approach.
Personally, I think the suggested linearization would make the operation easier to understand. Duplicating the unpaired return doesn't bother me. One thing that I think might increase readability would be to put each 'return' on its own (sub)step, because then:
|
I've got mixed feelings about this, because |
I’m not super concerned about small window rendering, but I’m very concerned about self-documenting names - maybe we could flip that one back? |
OK, fair enough. Reverted. |
Notes: - Once PR tc39#1532 is merged, UTF16Decode can be made more precide using CodePointAt. - There are a few other places in the spec where UTF-16 decoding is apparently involved (e.g., EscapeRegExpPattern), but it's not clear to me how to use UTF16Decode there.
- Make use of CodePointAt elsewhere in the spec
5bd2e7a
to
659fb6e
Compare
No description provided.