Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(unicode) detect supplementary private use area A #2102

Merged
merged 5 commits into from
Mar 19, 2020
Merged

fix(unicode) detect supplementary private use area A #2102

merged 5 commits into from
Mar 19, 2020

Conversation

ahuth
Copy link
Contributor

@ahuth ahuth commented Mar 19, 2020

This PR adds unicode characters in Supplementary Private Use Area-A to the getUnicodeNonBmpRegExp regex.

Closes issue: #2101

Reviewer checks

Required fields, to be filled out by PR reviewer(s)

  • Follows the commit message policy, appropriate for next version
  • Code is reviewed for security

@ahuth ahuth requested a review from a team as a code owner March 19, 2020 01:23
@CLAassistant
Copy link

CLAassistant commented Mar 19, 2020

CLA assistant check
All committers have signed the CLA.

@@ -69,6 +69,13 @@ describe('text.hasUnicode', function() {
});
assert.isTrue(actual);
});

it('returns true for a srting with characters in the Supplementary Private Use Area-A', function() {
var actual = axe.commons.text.hasUnicode('\uDB80\uDC19', {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string here is the same as the unicode escape sequence '\u{F0019}'. However, ES5 doesn't support those, and the tests are parsed by eslint as ES5. Instead this is the corresponding surrogate pair.

*
* Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp
*/

return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g;
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the u flag added to the regex. This enables using unicode escape sequences in the pattern.

One side-effect of this is that the build step actually compiles the regex to this in the final output:

function getUnicodeNonBmpRegExp() {
  return /(?:[\u1D00-\u1DFF\u20A0-\u27BF\uE000-\uF8FF]|[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}

The u flag is an ES6 feature, and I'm assuming that preset-env is compiling to a regex that will work in older ES versions.

Copy link
Contributor

@straker straker Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this range to the getSupplementaryPrivateUseRegExp function instead of the BMP function. I also think the text.hasUnicode should getSupplementaryPrivateUseRegExp for the nonBMP flag (currently it's missing and that's probably a bug since removeUnicode with nonBMP uses it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

Copy link
Contributor

@straker straker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr! The changes are great but I think we just need to move where they happen. I'll take a look at why the test_examples is failing.

*
* Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp
*/

return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g;
return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu;
Copy link
Contributor

@straker straker Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this range to the getSupplementaryPrivateUseRegExp function instead of the BMP function. I also think the text.hasUnicode should getSupplementaryPrivateUseRegExp for the nonBMP flag (currently it's missing and that's probably a bug since removeUnicode with nonBMP uses it).

ahuth added 3 commits March 19, 2020 09:23
Resolves #2101.

Characters in this range are not in the Basic Multilingual Plane [1].

1. https://en.wikipedia.org/wiki/Private_Use_Areas
The tests are parsed with eslint as ES5, which does not have unicode escape sequences. Therefore, "\u{F0019}" is not a valid string. Instead, we will have to use its corresponding surrogate pair, which is "\uDB80\uDC19".
…ementaryPrivateUseRegExp`

Instead of `getUnicodeNonBmpRegExp`. For this to work, I had to use `getSupplementaryPrivateUseRegExp` in `hasUnicode`. This seems appropriate, and its absence may have been a bug.
@ahuth
Copy link
Contributor Author

ahuth commented Mar 19, 2020

Made the changes, and the build looks green @straker .

I noticed yesterday that test_examples was green on one build, and then red on the next (where I only changed a comment).

Anyway, let me know if there's anything else we need to do here. Thanks

Copy link
Contributor

@straker straker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Looking much better. One final thing I can see

*/
return /[\uDB80-\uDBBF][\uDC00-\uDFFD]/g;
return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|[\u{F0000}-\u{FFFFD}]/gu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the u flag might pose a problem in some situations in IE11 as it's not supported. We've had quite the issue with these regexs before in IE11, so probably best to change this to use the surrogate pair codes as the [\uDB80-\uDBBF][\uDC00-\uDFFD] is doing.

Copy link
Contributor Author

@ahuth ahuth Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Babel's preset-env I believe is compiling this to a regex without the u flag, and we should be good with regards to IE11 support.

The compiled axe.js has:

// The compiled axe.js
function getSupplementaryPrivateUseRegExp() {
    return /(?:[\uDB80-\uDBBF](?![\uDC00-\uDFFF]))(?:(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFD])|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}

That's obviously a little bit gnarlier than it was before (especially with the added non-capturing groups).

Manually writing the surrogate pair I think would give us something like:

// unicode.js
function getSupplementaryPrivateUseRegExp() {
    return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g;
}

Because the regex gets compiled to a regex compatible with ES5, it may be preferable to keep the unicode escape sequences in the regex, since it's easier to see what the actual range is.

Do the windows tests run in IE11, or on Edge?

Or, if you still think it's better manually writing the surrogate pair (or if I'm missing something), I'll change it. Let me know! Thanks

Copy link
Contributor Author

@ahuth ahuth Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'll just take your suggestion and update it to

/[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\udb80-\udbbe][\udc00-\udfff]|\udbbf[\udc00-\udffe])/g

to be safe. The u flag does get compiled to a compatible regex, but with the surrogate pairs there's no ambiguity as to what's happening.

ahuth added 2 commits March 19, 2020 11:54
…ogate pairs

This will be more compatible. IE11 and ES5 environments do not support the `u` flag on regexes.
// 3. Supplementary private use area A (https://www.unicode.org/charts/PDF/UF0000.pdf)
//
// 1 2 3
// ┏━━━━━━┻━━━━━━┓┏━━━━━━┻━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
Copy link
Contributor Author

@ahuth ahuth Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little clever documenting which parts of the regex refer to which unicode sequence. Say the word and I'll remove this.

Copy link
Contributor

@straker straker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm a big fan of fancy comments

@straker straker changed the title Detect more unicode characters that are non-BMP fix(unicode) detect supplementary private use area A Mar 19, 2020
@straker straker merged commit f1739c2 into dequelabs:develop Mar 19, 2020
@ahuth ahuth deleted the unicode-private-use-area-A branch March 19, 2020 20:35
@ahuth
Copy link
Contributor Author

ahuth commented Mar 19, 2020

Thanks!

straker pushed a commit that referenced this pull request Mar 31, 2020
I originally added an additional range to the regex in #2102 while working on #2101. The purpose of it was to match unicode characters in Supplementary Private Use Area A, and I got the regex itself from https://apps.timwhitlock.info/js/regex.

However, upon further review, it's clear that - for the most part - the added regex is the same as what was previously there.

The existing part was:

  [\uDB80-\uDBBF][\uDC00-\uDFFD]

The part I added (with an "or") was:

  [\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD]

Note the overlap in the ranges. In fact, as far as I can tell, they're equivalent, except the previous part excluded anything where the low surrogate was DFFE or DFFF.

In this commit I've removed the unnecessary added portion after the "or", and fixed the original regex so that it correctly captures all characters in the range.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants