"0x10000 does not fit in UTF-16" #25

srl295 · 2025-01-02T20:47:38Z

Hi, thanks for this tool.

Trying to implement NMTOKEN - https://www.w3.org/TR/REC-xml/#d0e804 which includes a range "x10000-#xEFFFF" or in abnf %x10000-EFFFF and got the error:

Error: 0x10000 does not fit in UTF-16
    at Range.escape (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:329:15)
    at Range.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:340:25)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:60:12)
    at file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:58:23
    at Array.map (<anonymous>)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:58:14)
    at Alternation.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:189:14)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:60:12)
    at Rule.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:369:43)
    at Rules.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:440:21)

Of course U+10000 does fit in UTF-16 as two code units. Anyway, looking at the code perhaps this is a Peggy limitation?

Edit: yes, of course, https://peggyjs.org/documentation.html#locations and peggyjs/peggy#15 already discuss this, Peggy is BMP only.

The text was updated successfully, but these errors were encountered:

hildjj · 2025-01-02T20:51:42Z

First, see https://github.com/peggyjs/peggy/blob/main/examples/xml.peggy for ideas; it was coded by hand rather than using node-abnf, if I recall correctly.

In there, you'll see this:

Char
  = '\x09'
  / '\x0A'
  / '\x0D'
  / [\x20-\uD7FF]
  / [\uE000-\uFFFD]
  // / [\x10000-\x10FFFF] We're parsing a UTF16 code unit at a time. :(
  / $([\ud800-\udbff] [\udc00-\udfff])

which is he peggy bits you need. I thought that node-abnf would generate that for you automatically. Can you attach a pared-down input that shows the problem, please?

srl295 · 2025-01-02T20:59:27Z

First, see https://github.com/peggyjs/peggy/blob/main/examples/xml.peggy for ideas; it was coded by hand rather than using node-abnf, if I recall correctly.

In there, you'll see this:
Char
  = '\x09'
  / '\x0A'
  / '\x0D'
  / [\x20-\uD7FF]
  / [\uE000-\uFFFD]
  // / [\x10000-\x10FFFF] We're parsing a UTF16 code unit at a time. :(
  / $([\ud800-\udbff] [\udc00-\udfff])

That's basically the problem. I'm not trying to include ALL code points above FFFF.

I think the special case code is only if you are doing FFFF-10FFFF exactly.

which is he peggy bits you need. I thought that node-abnf would generate that for you automatically. Can you attach a pared-down input that shows the problem, please?

Sure,

NMCHARS = NMCHAR *(NMCHAR)
NMCHAR = %x10000-EFFFE

Interestingly, %x10000-10FFFF doesn't work (triggers the error),

however %xFFFF-10FFFF does 'work' (generates the range as you noted. However U+FFFF is a non-character.

hildjj · 2025-01-02T21:01:24Z

willfix.

srl295 · 2025-01-02T21:02:25Z

willfix.

thanks. I'm replying to the peggy discussion too.

- skip non-BMP rules for testing - see hildjj/node-abnf#25

This should correctly generate UTF-16 ranges for Peggy from any valid ABNF range. Fixes #25.

hildjj · 2025-01-03T09:37:46Z

NMCHARS = NMCHAR *(NMCHAR)
NMCHAR = %x10000-EFFFE

with #26, this will generate:

NMCHARS
  = NMCHAR (NMCHAR)*

NMCHAR
  = "\ud800" [\udc00-\udfff]
  / [\ud801-\udb7e] [\udc00-\udfff]
  / "\udb7f" [\udc00-\udffe]

Which an exhaustive test of 0-0x10ffff makes me believe is correct. Does it look right to you?

- add keyboard abnf and sample files and automated tests Temporarily skip non-BMP chars, see hildjj/node-abnf#25 which is being fixed.

srl295 · 2025-01-03T22:07:37Z

@hildjj (btw thanks for looking into all of this!)

  = "\ud800" [\udc00-\udfff]
  / [\ud801-\udb7e] [\udc00-\udfff]

minor: seems like these two could be one rule, right?

Otherwise yeah, this looks good!

hildjj · 2025-01-04T00:37:09Z

They could be combined into one rule in this case, but not without even more edge-case logic than is already in the current patch. I'm going to land what I have, and will accept a PR from anyone who wants to fiddle with it to optimize further.

hildjj · 2025-01-04T20:58:18Z

I've left a comment in ast.js:398 that will give the necessary information to anyone that wants to fix that last remaining nit.

srl295 added a commit to srl295/cldr that referenced this issue Jan 2, 2025

CLDR-18197 kbd: work around node-abnf utf-16 issue

8d77cbf

- skip non-BMP rules for testing - see hildjj/node-abnf#25

hildjj added a commit that referenced this issue Jan 3, 2025

Correctly handle unicode ranges

62b9efb

This should correctly generate UTF-16 ranges for Peggy from any valid ABNF range. Fixes #25.

hildjj mentioned this issue Jan 3, 2025

Correctly handle unicode ranges #26

Merged

hildjj added a commit that referenced this issue Jan 3, 2025

Correctly handle unicode ranges

e8d03bc

This should correctly generate UTF-16 ranges for Peggy from any valid ABNF range. Fixes #25.

srl295 added a commit to srl295/cldr that referenced this issue Jan 3, 2025

CLDR-16836 kbd: add EBNF to spec, ABNF to data for transform grammar

d38f212

- add keyboard abnf and sample files and automated tests Temporarily skip non-BMP chars, see hildjj/node-abnf#25 which is being fixed.

hildjj closed this as completed in #26 Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"0x10000 does not fit in UTF-16" #25

"0x10000 does not fit in UTF-16" #25

srl295 commented Jan 2, 2025 •

edited

Loading

hildjj commented Jan 2, 2025

srl295 commented Jan 2, 2025

hildjj commented Jan 2, 2025

srl295 commented Jan 2, 2025

hildjj commented Jan 3, 2025

srl295 commented Jan 3, 2025

hildjj commented Jan 4, 2025

hildjj commented Jan 4, 2025

"0x10000 does not fit in UTF-16" #25

"0x10000 does not fit in UTF-16" #25

Comments

srl295 commented Jan 2, 2025 • edited Loading

hildjj commented Jan 2, 2025

srl295 commented Jan 2, 2025

hildjj commented Jan 2, 2025

srl295 commented Jan 2, 2025

hildjj commented Jan 3, 2025

srl295 commented Jan 3, 2025

hildjj commented Jan 4, 2025

hildjj commented Jan 4, 2025

srl295 commented Jan 2, 2025 •

edited

Loading