Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"0x10000 does not fit in UTF-16" #25

Closed
srl295 opened this issue Jan 2, 2025 · 8 comments · Fixed by #26
Closed

"0x10000 does not fit in UTF-16" #25

srl295 opened this issue Jan 2, 2025 · 8 comments · Fixed by #26

Comments

@srl295
Copy link
Contributor

srl295 commented Jan 2, 2025

Hi, thanks for this tool.

Trying to implement NMTOKEN - https://www.w3.org/TR/REC-xml/#d0e804 which includes a range "x10000-#xEFFFF" or in abnf %x10000-EFFFF and got the error:

Error: 0x10000 does not fit in UTF-16
    at Range.escape (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:329:15)
    at Range.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:340:25)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:60:12)
    at file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:58:23
    at Array.map (<anonymous>)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:58:14)
    at Alternation.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:189:14)
    at fromArray (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:60:12)
    at Rule.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:369:43)
    at Rules.toFormat (file:///Users/srl295/.npm/_npx/f593a35889f0a96c/node_modules/abnf/lib/ast.js:440:21)

Of course U+10000 does fit in UTF-16 as two code units. Anyway, looking at the code perhaps this is a Peggy limitation?

Edit: yes, of course, https://peggyjs.org/documentation.html#locations and peggyjs/peggy#15 already discuss this, Peggy is BMP only.

@hildjj
Copy link
Owner

hildjj commented Jan 2, 2025

First, see https://github.com/peggyjs/peggy/blob/main/examples/xml.peggy for ideas; it was coded by hand rather than using node-abnf, if I recall correctly.

In there, you'll see this:

Char
  = '\x09'
  / '\x0A'
  / '\x0D'
  / [\x20-\uD7FF]
  / [\uE000-\uFFFD]
  // / [\x10000-\x10FFFF] We're parsing a UTF16 code unit at a time. :(
  / $([\ud800-\udbff] [\udc00-\udfff])

which is he peggy bits you need. I thought that node-abnf would generate that for you automatically. Can you attach a pared-down input that shows the problem, please?

@srl295
Copy link
Contributor Author

srl295 commented Jan 2, 2025

First, see https://github.com/peggyjs/peggy/blob/main/examples/xml.peggy for ideas; it was coded by hand rather than using node-abnf, if I recall correctly.

In there, you'll see this:

Char
  = '\x09'
  / '\x0A'
  / '\x0D'
  / [\x20-\uD7FF]
  / [\uE000-\uFFFD]
  // / [\x10000-\x10FFFF] We're parsing a UTF16 code unit at a time. :(
  / $([\ud800-\udbff] [\udc00-\udfff])

That's basically the problem. I'm not trying to include ALL code points above FFFF.

I think the special case code is only if you are doing FFFF-10FFFF exactly.

which is he peggy bits you need. I thought that node-abnf would generate that for you automatically. Can you attach a pared-down input that shows the problem, please?

Sure,

NMCHARS = NMCHAR *(NMCHAR)
NMCHAR = %x10000-EFFFE

Interestingly, %x10000-10FFFF doesn't work (triggers the error),

however %xFFFF-10FFFF does 'work' (generates the range as you noted. However U+FFFF is a non-character.

@hildjj
Copy link
Owner

hildjj commented Jan 2, 2025

willfix.

@srl295
Copy link
Contributor Author

srl295 commented Jan 2, 2025

willfix.

thanks. I'm replying to the peggy discussion too.

srl295 added a commit to srl295/cldr that referenced this issue Jan 2, 2025
hildjj added a commit that referenced this issue Jan 3, 2025
This should correctly generate UTF-16 ranges for
Peggy from any valid ABNF range.
Fixes #25.
hildjj added a commit that referenced this issue Jan 3, 2025
This should correctly generate UTF-16 ranges for
Peggy from any valid ABNF range.
Fixes #25.
@hildjj
Copy link
Owner

hildjj commented Jan 3, 2025

NMCHARS = NMCHAR *(NMCHAR)
NMCHAR = %x10000-EFFFE

with #26, this will generate:

NMCHARS
  = NMCHAR (NMCHAR)*

NMCHAR
  = "\ud800" [\udc00-\udfff]
  / [\ud801-\udb7e] [\udc00-\udfff]
  / "\udb7f" [\udc00-\udffe]

Which an exhaustive test of 0-0x10ffff makes me believe is correct. Does it look right to you?

srl295 added a commit to srl295/cldr that referenced this issue Jan 3, 2025
- add keyboard abnf and sample files and automated tests

Temporarily skip non-BMP chars,
see hildjj/node-abnf#25 which is being fixed.
@srl295
Copy link
Contributor Author

srl295 commented Jan 3, 2025

@hildjj (btw thanks for looking into all of this!)

  = "\ud800" [\udc00-\udfff]
  / [\ud801-\udb7e] [\udc00-\udfff]

minor: seems like these two could be one rule, right?

Otherwise yeah, this looks good!

@hildjj
Copy link
Owner

hildjj commented Jan 4, 2025

They could be combined into one rule in this case, but not without even more edge-case logic than is already in the current patch. I'm going to land what I have, and will accept a PR from anyone who wants to fiddle with it to optimize further.

@hildjj hildjj closed this as completed in #26 Jan 4, 2025
@hildjj
Copy link
Owner

hildjj commented Jan 4, 2025

I've left a comment in ast.js:398 that will give the necessary information to anyone that wants to fix that last remaining nit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants