fix(unicode) detect supplementary private use area A #2102

ahuth · 2020-03-19T01:23:03Z

This PR adds unicode characters in Supplementary Private Use Area-A to the getUnicodeNonBmpRegExp regex.

Closes issue: #2101

Reviewer checks

Required fields, to be filled out by PR reviewer(s)

Follows the commit message policy, appropriate for next version
Code is reviewed for security

CLAassistant · 2020-03-19T01:23:08Z

All committers have signed the CLA.

ahuth · 2020-03-19T05:25:20Z

test/commons/text/unicode.js

@@ -69,6 +69,13 @@ describe('text.hasUnicode', function() {
 			});
 			assert.isTrue(actual);
 		});
+
+		it('returns true for a srting with characters in the Supplementary Private Use Area-A', function() {
+			var actual = axe.commons.text.hasUnicode('\uDB80\uDC19', {


The string here is the same as the unicode escape sequence '\u{F0019}'. However, ES5 doesn't support those, and the tests are parsed by eslint as ES5. Instead this is the corresponding surrogate pair.

ahuth · 2020-03-19T05:46:24Z

lib/commons/text/unicode.js

 	 *
 	 * Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp
 	 */

-	return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g;
+	return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu;


Note the u flag added to the regex. This enables using unicode escape sequences in the pattern.

One side-effect of this is that the build step actually compiles the regex to this in the final output:

function getUnicodeNonBmpRegExp() { return /(?:[\u1D00-\u1DFF\u20A0-\u27BF\uE000-\uF8FF]|[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g; }

The u flag is an ES6 feature, and I'm assuming that preset-env is compiling to a regex that will work in older ES versions.

I think we should add this range to the getSupplementaryPrivateUseRegExp function instead of the BMP function. I also think the text.hasUnicode should getSupplementaryPrivateUseRegExp for the nonBMP flag (currently it's missing and that's probably a bug since removeUnicode with nonBMP uses it).

Makes sense 👍

straker

Thanks for the pr! The changes are great but I think we just need to move where they happen. I'll take a look at why the test_examples is failing.

straker · 2020-03-19T14:40:51Z

lib/commons/text/unicode.js

 	 *
 	 * Note: plane '\u2000-\u206F' used for General punctuation is excluded as it is handled in -> getPunctuationRegExp
 	 */

-	return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF]/g;
+	return /[\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u20A0-\u20CF\u20D0-\u20FF\u2100-\u214F\u2150-\u218F\u2190-\u21FF\u2200-\u22FF\u2300-\u23FF\u2400-\u243F\u2440-\u245F\u2460-\u24FF\u2500-\u257F\u2580-\u259F\u25A0-\u25FF\u2600-\u26FF\u2700-\u27BF\uE000-\uF8FF\u{F0000}-\u{FFFFD}]/gu;


I think we should add this range to the getSupplementaryPrivateUseRegExp function instead of the BMP function. I also think the text.hasUnicode should getSupplementaryPrivateUseRegExp for the nonBMP flag (currently it's missing and that's probably a bug since removeUnicode with nonBMP uses it).

Resolves #2101. Characters in this range are not in the Basic Multilingual Plane [1]. 1. https://en.wikipedia.org/wiki/Private_Use_Areas

The tests are parsed with eslint as ES5, which does not have unicode escape sequences. Therefore, "\u{F0019}" is not a valid string. Instead, we will have to use its corresponding surrogate pair, which is "\uDB80\uDC19".

…ementaryPrivateUseRegExp` Instead of `getUnicodeNonBmpRegExp`. For this to work, I had to use `getSupplementaryPrivateUseRegExp` in `hasUnicode`. This seems appropriate, and its absence may have been a bug.

ahuth · 2020-03-19T16:45:27Z

Made the changes, and the build looks green @straker .

I noticed yesterday that test_examples was green on one build, and then red on the next (where I only changed a comment).

Anyway, let me know if there's anything else we need to do here. Thanks

straker

Awesome. Looking much better. One final thing I can see

straker · 2020-03-19T17:26:19Z

lib/commons/text/unicode.js

 	 */
-	return /[\uDB80-\uDBBF][\uDC00-\uDFFD]/g;
+	return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|[\u{F0000}-\u{FFFFD}]/gu;


So the u flag might pose a problem in some situations in IE11 as it's not supported. We've had quite the issue with these regexs before in IE11, so probably best to change this to use the surrogate pair codes as the [\uDB80-\uDBBF][\uDC00-\uDFFD] is doing.

Babel's preset-env I believe is compiling this to a regex without the u flag, and we should be good with regards to IE11 support.

The compiled axe.js has:

// The compiled axe.js function getSupplementaryPrivateUseRegExp() { return /(?:[\uDB80-\uDBBF](?![\uDC00-\uDFFF]))(?:(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFD])|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g; }

That's obviously a little bit gnarlier than it was before (especially with the added non-capturing groups).

Manually writing the surrogate pair I think would give us something like:

// unicode.js function getSupplementaryPrivateUseRegExp() { return /[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD])/g; }

Because the regex gets compiled to a regex compatible with ES5, it may be preferable to keep the unicode escape sequences in the regex, since it's easier to see what the actual range is.

Do the windows tests run in IE11, or on Edge?

Or, if you still think it's better manually writing the surrogate pair (or if I'm missing something), I'll change it. Let me know! Thanks

Actually, I'll just take your suggestion and update it to

/[\uDB80-\uDBBF][\uDC00-\uDFFD]|(?:[\udb80-\udbbe][\udc00-\udfff]|\udbbf[\udc00-\udffe])/g

to be safe. The u flag does get compiled to a compatible regex, but with the surrogate pairs there's no ambiguity as to what's happening.

…ogate pairs This will be more compatible. IE11 and ES5 environments do not support the `u` flag on regexes.

…refer to which unicode range

ahuth · 2020-03-19T19:00:26Z

lib/commons/text/unicode.js

+	// 3. Supplementary private use area A (https://www.unicode.org/charts/PDF/UF0000.pdf)
+	//
+	//             1              2                                  3
+	//      ┏━━━━━━┻━━━━━━┓┏━━━━━━┻━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓


I got a little clever documenting which parts of the regex refer to which unicode sequence. Say the word and I'll remove this.

straker

Thanks. I'm a big fan of fancy comments

ahuth · 2020-03-19T20:35:30Z

Thanks!

I originally added an additional range to the regex in #2102 while working on #2101. The purpose of it was to match unicode characters in Supplementary Private Use Area A, and I got the regex itself from https://apps.timwhitlock.info/js/regex. However, upon further review, it's clear that - for the most part - the added regex is the same as what was previously there. The existing part was: [\uDB80-\uDBBF][\uDC00-\uDFFD] The part I added (with an "or") was: [\uDB80-\uDBBE][\uDC00-\uDFFF]|\uDBBF[\uDC00-\uDFFD] Note the overlap in the ranges. In fact, as far as I can tell, they're equivalent, except the previous part excluded anything where the low surrogate was DFFE or DFFF. In this commit I've removed the unnecessary added portion after the "or", and fixed the original regex so that it correctly captures all characters in the range.

ahuth requested a review from a team as a code owner March 19, 2020 01:23

ahuth mentioned this pull request Mar 19, 2020

Color contrast checking of unicode private use characters #2101

Closed

ahuth commented Mar 19, 2020

View reviewed changes

straker requested changes Mar 19, 2020

View reviewed changes

ahuth added 3 commits March 19, 2020 09:23

Detect unicode in the Supplementary Private Use Area-A

647463f

Resolves #2101. Characters in this range are not in the Basic Multilingual Plane [1]. 1. https://en.wikipedia.org/wiki/Private_Use_Areas

Use surrogate pair representation for unicode sequence

20c5df8

The tests are parsed with eslint as ES5, which does not have unicode escape sequences. Therefore, "\u{F0019}" is not a valid string. Instead, we will have to use its corresponding surrogate pair, which is "\uDB80\uDC19".

Move detection of certain private use unicode characters to `getSuppl…

30aef40

…ementaryPrivateUseRegExp` Instead of `getUnicodeNonBmpRegExp`. For this to work, I had to use `getSupplementaryPrivateUseRegExp` in `hasUnicode`. This seems appropriate, and its absence may have been a bug.

straker requested changes Mar 19, 2020

View reviewed changes

ahuth added 2 commits March 19, 2020 11:54

Replace unicode escape sequences in regex with the corresponding surr…

0e3777a

…ogate pairs This will be more compatible. IE11 and ES5 environments do not support the `u` flag on regexes.

Document which parts of the getSupplementaryPrivateUseRegExp regex …

6a90379

…refer to which unicode range

ahuth commented Mar 19, 2020

View reviewed changes

straker approved these changes Mar 19, 2020

View reviewed changes

straker changed the title ~~Detect more unicode characters that are non-BMP~~ fix(unicode) detect supplementary private use area A Mar 19, 2020

straker merged commit f1739c2 into dequelabs:develop Mar 19, 2020

ahuth deleted the unicode-private-use-area-A branch March 19, 2020 20:35

ahuth mentioned this pull request Mar 22, 2020

chore: simplify the regex getting supplementary private use characters #2106

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(unicode) detect supplementary private use area A #2102

fix(unicode) detect supplementary private use area A #2102

ahuth commented Mar 19, 2020 •

edited by straker

Loading

CLAassistant commented Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020

ahuth Mar 19, 2020

straker Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020

straker left a comment

straker Mar 19, 2020 •

edited

Loading

ahuth commented Mar 19, 2020 •

edited

Loading

straker left a comment

straker Mar 19, 2020

ahuth Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020 •

edited

Loading

straker left a comment

ahuth commented Mar 19, 2020

fix(unicode) detect supplementary private use area A #2102

fix(unicode) detect supplementary private use area A #2102

Conversation

ahuth commented Mar 19, 2020 • edited by straker Loading

Reviewer checks

CLAassistant commented Mar 19, 2020 • edited Loading

ahuth Mar 19, 2020

Choose a reason for hiding this comment

ahuth Mar 19, 2020

Choose a reason for hiding this comment

straker Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

ahuth Mar 19, 2020

Choose a reason for hiding this comment

straker left a comment

Choose a reason for hiding this comment

straker Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

ahuth commented Mar 19, 2020 • edited Loading

straker left a comment

Choose a reason for hiding this comment

straker Mar 19, 2020

Choose a reason for hiding this comment

ahuth Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

ahuth Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

ahuth Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

straker left a comment

Choose a reason for hiding this comment

ahuth commented Mar 19, 2020

ahuth commented Mar 19, 2020 •

edited by straker

Loading

CLAassistant commented Mar 19, 2020 •

edited

Loading

straker Mar 19, 2020 •

edited

Loading

straker Mar 19, 2020 •

edited

Loading

ahuth commented Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020 •

edited

Loading

ahuth Mar 19, 2020 •

edited

Loading