-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(unicode) detect supplementary private use area A #2102
Conversation
@@ -69,6 +69,13 @@ describe('text.hasUnicode', function() { | |||
}); | |||
assert.isTrue(actual); | |||
}); | |||
|
|||
it('returns true for a srting with characters in the Supplementary Private Use Area-A', function() { | |||
var actual = axe.commons.text.hasUnicode('\uDB80\uDC19', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string here is the same as the unicode escape sequence '\u{F0019}'
. However, ES5 doesn't support those, and the tests are parsed by eslint as ES5. Instead this is the corresponding surrogate pair.
lib/commons/text/unicode.js
Outdated
* | ||
* Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp | ||
*/ | ||
|
||
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g; | ||
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the u
flag added to the regex. This enables using unicode escape sequences in the pattern.
One side-effect of this is that the build step actually compiles the regex to this in the final output:
function getUnicodeNonBmpRegExp() {
return /(?:[\u1D00-\u1DFF\u20A0-\u27BF\uE000-\uF8FF]|[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}
The u
flag is an ES6 feature, and I'm assuming that preset-env is compiling to a regex that will work in older ES versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this range to the getSupplementaryPrivateUseRegExp
function instead of the BMP function. I also think the text.hasUnicode
should getSupplementaryPrivateUseRegExp
for the nonBMP
flag (currently it's missing and that's probably a bug since removeUnicode
with nonBMP
uses it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr! The changes are great but I think we just need to move where they happen. I'll take a look at why the test_examples is failing.
lib/commons/text/unicode.js
Outdated
* | ||
* Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp | ||
*/ | ||
|
||
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g; | ||
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this range to the getSupplementaryPrivateUseRegExp
function instead of the BMP function. I also think the text.hasUnicode
should getSupplementaryPrivateUseRegExp
for the nonBMP
flag (currently it's missing and that's probably a bug since removeUnicode
with nonBMP
uses it).
Resolves #2101. Characters in this range are not in the Basic Multilingual Plane [1]. 1. https://en.wikipedia.org/wiki/Private_Use_Areas
The tests are parsed with eslint as ES5, which does not have unicode escape sequences. Therefore, "\u{F0019}" is not a valid string. Instead, we will have to use its corresponding surrogate pair, which is "\uDB80\uDC19".
…ementaryPrivateUseRegExp` Instead of `getUnicodeNonBmpRegExp`. For this to work, I had to use `getSupplementaryPrivateUseRegExp` in `hasUnicode`. This seems appropriate, and its absence may have been a bug.
Made the changes, and the build looks green @straker . I noticed yesterday that test_examples was green on one build, and then red on the next (where I only changed a comment). Anyway, let me know if there's anything else we need to do here. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Looking much better. One final thing I can see
lib/commons/text/unicode.js
Outdated
*/ | ||
return /[\uDB80-\uDBBF][\uDC00-\uDFFD]/g; | ||
return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|[\u{F0000}-\u{FFFFD}]/gu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the u
flag might pose a problem in some situations in IE11 as it's not supported. We've had quite the issue with these regexs before in IE11, so probably best to change this to use the surrogate pair codes as the [\uDB80-\uDBBF][\uDC00-\uDFFD]
is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Babel's preset-env I believe is compiling this to a regex without the u
flag, and we should be good with regards to IE11 support.
The compiled axe.js has:
// The compiled axe.js
function getSupplementaryPrivateUseRegExp() {
return /(?:[\uDB80-\uDBBF](?![\uDC00-\uDFFF]))(?:(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFD])|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}
That's obviously a little bit gnarlier than it was before (especially with the added non-capturing groups).
Manually writing the surrogate pair I think would give us something like:
// unicode.js
function getSupplementaryPrivateUseRegExp() {
return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}
Because the regex gets compiled to a regex compatible with ES5, it may be preferable to keep the unicode escape sequences in the regex, since it's easier to see what the actual range is.
Do the windows tests run in IE11, or on Edge?
Or, if you still think it's better manually writing the surrogate pair (or if I'm missing something), I'll change it. Let me know! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'll just take your suggestion and update it to
/[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\udb80-\udbbe][\udc00-\udfff]|\udbbf[\udc00-\udffe])/g
to be safe. The u
flag does get compiled to a compatible regex, but with the surrogate pairs there's no ambiguity as to what's happening.
…ogate pairs This will be more compatible. IE11 and ES5 environments do not support the `u` flag on regexes.
…refer to which unicode range
// 3. Supplementary private use area A (https://www.unicode.org/charts/PDF/UF0000.pdf) | ||
// | ||
// 1 2 3 | ||
// ┏━━━━━━┻━━━━━━┓┏━━━━━━┻━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little clever documenting which parts of the regex refer to which unicode sequence. Say the word and I'll remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm a big fan of fancy comments
Thanks! |
I originally added an additional range to the regex in #2102 while working on #2101. The purpose of it was to match unicode characters in Supplementary Private Use Area A, and I got the regex itself from https://apps.timwhitlock.info/js/regex. However, upon further review, it's clear that - for the most part - the added regex is the same as what was previously there. The existing part was: [\uDB80-\uDBBF][\uDC00-\uDFFD] The part I added (with an "or") was: [\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD] Note the overlap in the ranges. In fact, as far as I can tell, they're equivalent, except the previous part excluded anything where the low surrogate was DFFE or DFFF. In this commit I've removed the unnecessary added portion after the "or", and fixed the original regex so that it correctly captures all characters in the range.
This PR adds unicode characters in Supplementary Private Use Area-A to the
getUnicodeNonBmpRegExp
regex.Closes issue: #2101
Reviewer checks
Required fields, to be filled out by PR reviewer(s)