Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Normative] Add RegExp Unicode property escapes #1041

Merged
merged 1 commit into from
Jan 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 152 additions & 4 deletions spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@
#ecma-logo {
width: 500px;
}
.unicode-property-table {
table-layout: fixed;
width: 100%;
font-size: 80%;
}
.unicode-property-table ul {
padding-left: 0;
list-style: none;
}
</style>
<script>
if (location.hostname === 'tc39.github.io' && location.protocol !== 'https:') {
Expand Down Expand Up @@ -29204,8 +29213,51 @@ <h2>Syntax</h2>
DecimalEscape ::
NonZeroDigit DecimalDigits? [lookahead &lt;! DecimalDigit]

CharacterClassEscape :: one of
`d` `D` `s` `S` `w` `W`
CharacterClassEscape[U] ::
`d`
`D`
`s`
`S`
`w`
`W`
[+U] `p{` UnicodePropertyValueExpression `}`
[+U] `P{` UnicodePropertyValueExpression `}`

UnicodePropertyValueExpression ::
UnicodePropertyName `=` UnicodePropertyValue
LoneUnicodePropertyNameOrValue

UnicodePropertyNameCharacter ::
ControlLetter
`_`

UnicodePropertyNameCharacters ::
UnicodePropertyNameCharacter UnicodePropertyNameCharacters?

UnicodePropertyName ::
UnicodePropertyNameCharacters

UnicodePropertyValueCharacter ::
UnicodePropertyNameCharacter
`0`
`1`
`2`
`3`
`4`
`5`
`6`
`7`
`8`
`9`

UnicodePropertyValueCharacters ::
UnicodePropertyValueCharacter UnicodePropertyValueCharacters?

UnicodePropertyValue ::
UnicodePropertyValueCharacters

LoneUnicodePropertyNameOrValue ::
UnicodePropertyValueCharacters

CharacterClass[U] ::
`[` [lookahead &lt;! {`^`}] ClassRanges[?U] `]`
Expand Down Expand Up @@ -29300,6 +29352,21 @@ <h1>Static Semantics: Early Errors</h1>
It is a Syntax Error if SV(|RegExpUnicodeEscapeSequence|) is none of `"$"`, or `"_"`, or the UTF16Encoding of either &lt;ZWNJ&gt; or &lt;ZWJ&gt;, or the UTF16Encoding of a Unicode code point that would be matched by the |UnicodeIDContinue| lexical grammar production.
</li>
</ul>
<emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar>
<ul>
<li>
It is a Syntax Error if the List of Unicode code points that is SourceText of <emu-nt>UnicodePropertyName</emu-nt> is not identical to a List of Unicode code points that is a Unicode property name or property alias listed in the “Property name and aliases” column of <emu-xref href="#table-nonbinary-unicode-properties"></emu-xref>.
</li>
<li>
It is a Syntax Error if the List of Unicode code points that is SourceText of <emu-nt>UnicodePropertyValue</emu-nt> is not identical to a List of Unicode code points that is a value or value alias for the Unicode property or property alias given by SourceText of <emu-nt>UnicodePropertyName</emu-nt> listed in the “Property value and aliases” column of the corresponding tables <emu-xref href="#table-unicode-general-category-values"></emu-xref> or <emu-xref href="#table-unicode-script-values"></emu-xref>.
</li>
</ul>
<emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar>
<ul>
<li>
It is a Syntax Error if the List of Unicode code points that is SourceText of <emu-nt>LoneUnicodePropertyNameOrValue</emu-nt> is not identical to a List of Unicode code points that is a Unicode general category or general category alias listed in the “Property value and aliases” column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, nor a binary property or binary property alias listed in the “Property name and aliases” column of <emu-xref href="#table-binary-unicode-properties"></emu-xref>.
</li>
</ul>
</emu-clause>

<emu-clause id="sec-patterns-static-semantics-capturing-group-number">
Expand Down Expand Up @@ -29537,6 +29604,15 @@ <h1>Static Semantics: CharacterValue</h1>
1. Return the code point value of _ch_.
</emu-alg>
</emu-clause>

<emu-clause id="sec-static-semantics-sourcetext">
<h1>Static Semantics: SourceText</h1>
<emu-grammar>UnicodePropertyNameCharacters :: UnicodePropertyNameCharacter UnicodePropertyNameCharacters?</emu-grammar>
<emu-grammar>UnicodePropertyValueCharacters :: UnicodePropertyValueCharacter UnicodePropertyValueCharacters?</emu-grammar>
<emu-alg>
1. Return the List, in source text order, of Unicode code points in the source text matched by this production.
</emu-alg>
</emu-clause>
</emu-clause>

<!-- es6num="21.2.2" -->
Expand Down Expand Up @@ -30241,6 +30317,49 @@ <h1>Runtime Semantics: Canonicalize ( _ch_ )</h1>
<p>In case-insignificant matches when _Unicode_ is *true*, all characters are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared. The simple mapping always maps to a single code point, so it does not map, for example, `"&szlig;"` (U+00DF) to `"SS"`. It may however map a code point outside the Basic Latin range to a character within, for example, `"&#x17f;"` (U+017F) to `"s"`. Such characters are not mapped if _Unicode_ is *false*. This prevents Unicode code points such as U+017F and U+212A from matching regular expressions such as `/[a-z]/i`, but they will match `/[a-z]/ui`.</p>
</emu-note>
</emu-clause>
<emu-clause id="sec-runtime-semantics-unicodematchproperty-p" aoid="UnicodeMatchProperty">
<h1>Runtime Semantics: UnicodeMatchProperty ( _p_ )</h1>
<p>The algorithm uses values from the following tables, which associate supported Unicode property names and property aliases and their canonical property names.</p>
<p>Implementations must support the following non-binary Unicode properties and their property aliases:</p>
<emu-import href="table-nonbinary-unicode-properties.html"></emu-import>
<p>Additionally, implementations must support the following binary Unicode properties and their property aliases:</p>
<emu-import href="table-binary-unicode-properties.html"></emu-import>
<p>The abstract operation UnicodeMatchProperty takes a parameter _p_ that is a List of Unicode code points and performs the following steps:</p>
<emu-alg>
1. Assert: _p_ is a List of Unicode code points that is identical to a List of Unicode code points that is a Unicode property name or property alias listed in the “Property name and aliases” column of <emu-xref href="#table-nonbinary-unicode-properties"></emu-xref> or <emu-xref href="#table-binary-unicode-properties"></emu-xref>.
1. Let _p_ be the canonical property name of _p_ as given in the “Canonical property name” column of the corresponding row.
1. Return the List of Unicode code points of _p_.
</emu-alg>
<p>To ensure interoperability, implementations must not extend Unicode property support to the remaining properties.</p>
<p>Implementations must only recognize the property aliases listed in <emu-xref href="#table-nonbinary-unicode-properties"></emu-xref> and <emu-xref href="#table-binary-unicode-properties"></emu-xref>.</p>
<p>Implementations must only recognize the property value aliases and canonical property value names listed in <emu-xref href="#table-unicode-general-category-values"></emu-xref> and <emu-xref href="#table-unicode-script-values"></emu-xref>.</p>
<emu-note>
<p>For example, `Script_Extensions` (property name) and `scx` (property alias) are valid, but `script_extensions` or `Scx` aren’t.</p>
</emu-note>
<emu-note>
<p>The listed properties form a superset of what <a href="https://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a> requires.</p>
</emu-note>
</emu-clause>
<emu-clause id="sec-runtime-semantics-unicodematchpropertyvalue-p-v" aoid="UnicodeMatchPropertyValue">
<h1>Runtime Semantics: UnicodeMatchPropertyValue ( _p_, _v_ )</h1>
<p>The algorithm uses values from the following tables, which associate canonical Unicode property names and their supported values and value aliases:</p>
<emu-import href="table-unicode-general-category-values.html"></emu-import>
<emu-import href="table-unicode-script-values.html"></emu-import>
<p>The abstract operation UnicodeMatchPropertyValue takes two parameters _p_ and _v_, each of which is a List of Unicode code points, and performs the following steps:</p>
<emu-alg>
1. Assert: _p_ is a List of Unicode code points that is identical to a List of Unicode code points that is a canonical, unaliased Unicode property name listed in the “Canonical property name” column of <emu-xref href="#table-nonbinary-unicode-properties"></emu-xref>.
1. Assert: _v_ is a List of Unicode code points that is identical to a List of Unicode code points that is a property value or property value alias for Unicode property _p_ listed in the “Property value and aliases” column of <emu-xref href="#table-unicode-general-category-values"></emu-xref> or <emu-xref href="#table-unicode-script-values"></emu-xref>.
1. Let _value_ be the canonical property value of _v_ as given in the “Canonical property value” column of the corresponding row.
1. Return the List of Unicode code points of _value_.
</emu-alg>
<p>Only the canonical property values and property value aliases listed in <emu-xref href="#table-unicode-general-category-values"></emu-xref> and <emu-xref href="#table-unicode-script-values"></emu-xref> must be recognized.</p>
<emu-note>
<p>For example, `Xpeo` and `Old_Persian` are valid `Script_Extension` values, but `xpeo` and `Old Persian` aren’t.</p>
</emu-note>
<emu-note>
<p>This algorithm differs from <a href="https://unicode.org/reports/tr44/#Matching_Symbolic">the matching rules for symbolic values listed in UAX44</a>: case, <emu-xref href="#sec-white-space">white space</emu-xref>, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the `Is` prefix is not supported.</p>
</emu-note>
</emu-clause>
</emu-clause>

<!-- es6num="21.2.2.9" -->
Expand Down Expand Up @@ -30334,6 +30453,23 @@ <h1>CharacterClassEscape</h1>
<emu-alg>
1. Return the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> .
</emu-alg>
<p>The production <emu-grammar>CharacterClassEscape :: `\p{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the CharSet containing all Unicode code points included in the CharSet returned by <emu-nt>UnicodePropertyValueExpression</emu-nt>.</p>
<p>The production <emu-grammar>CharacterClassEscape :: `\P{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by <emu-nt>UnicodePropertyValueExpression</emu-nt>.</p>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. Let _p_ be ! UnicodeMatchProperty(_UnicodePropertyName_).
1. Assert: _p_ is a Unicode property name or property alias listed in the “Property name and aliases” column of <emu-xref href="#table-nonbinary-unicode-properties"></emu-xref>.
1. Let _v_ be ! UnicodeMatchPropertyValue(_p_, _UnicodePropertyValue_).
1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value _v_.
</emu-alg>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. If ! UnicodeMatchPropertyValue(`"General_Category"`, _LoneUnicodePropertyNameOrValue_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, then
1. Return the CharSet containing all Unicode code points whose character database definition includes the property `General_Category` with value _LoneUnicodePropertyNameOrValue_.
1. Let _p_ be ! UnicodeMatchProperty(_LoneUnicodePropertyNameOrValue_).
1. Assert: _p_ is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of <emu-xref href="#table-binary-unicode-properties"></emu-xref>.
1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value |True|.
</emu-alg>
</emu-clause>

<!-- es6num="21.2.2.13" -->
Expand Down Expand Up @@ -40493,17 +40629,29 @@ <h1>Bibliography</h1>
<li>
<i>The Unicode Standard</i>, available at &lt;<a href="https://unicode.org/versions/latest">https://unicode.org/versions/latest</a>&gt;
</li>
<li>
<i>Unicode Technical Note #5: Canonical Equivalence in Applications</i>, available at &lt;<a href="https://unicode.org/notes/tn5/">https://unicode.org/notes/tn5/</a>&gt;
</li>
<li>
<i>Unicode Technical Standard #10: Unicode Collation Algorithm</i>, available at &lt;<a href="https://unicode.org/reports/tr10/">https://unicode.org/reports/tr10/</a>&gt;
</li>
<li>
<i>Unicode Standard Annex #15, Unicode Normalization Forms</i>, available at &lt;<a href="https://unicode.org/reports/tr15/">https://unicode.org/reports/tr15/</a>&gt;
</li>
<li>
<i>Unicode Standard Annex #18: Unicode Regular Expressions</i>, available at &lt;<a href="https://unicode.org/reports/tr18/">https://unicode.org/reports/tr18/</a>&gt;
</li>
<li>
<i>Unicode Standard Annex #24: Unicode `Script` Property</i>, available at &lt;<a href="https://unicode.org/reports/tr24/">https://unicode.org/reports/tr24/</a>&gt;
</li>
<li>
<i>Unicode Standard Annex #31, Unicode Identifiers and Pattern Syntax</i>, available at &lt;<a href="https://unicode.org/reports/tr31/">https://unicode.org/reports/tr31/</a>&gt;
</li>
<li>
<i>Unicode Technical Note #5: Canonical Equivalence in Applications</i>, available at &lt;<a href="https://unicode.org/notes/tn5/">https://unicode.org/notes/tn5/</a>&gt;
<i>Unicode Standard Annex #44: Unicode Character Database</i>, available at &lt;<a href="https://unicode.org/reports/tr44/">https://unicode.org/reports/tr44/</a>&gt;
</li>
<li>
<i>Unicode Technical Standard #10: Unicode Collation Algorithm</i>, available at &lt;<a href="https://unicode.org/reports/tr10/">https://unicode.org/reports/tr10/</a>&gt;
<i>Unicode Technical Standard #51: Unicode Emoji</i>, available at &lt;<a href="https://unicode.org/reports/tr51/">https://unicode.org/reports/tr51/</a>&gt;
</li>
<li>
<i>IANA Time Zone Database</i>, available at &lt;<a href="https://www.iana.org/time-zones">https://www.iana.org/time-zones</a>&gt;
Expand Down
Loading