-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"0x10000 does not fit in UTF-16" #25
Comments
First, see https://github.com/peggyjs/peggy/blob/main/examples/xml.peggy for ideas; it was coded by hand rather than using node-abnf, if I recall correctly. In there, you'll see this: Char
= '\x09'
/ '\x0A'
/ '\x0D'
/ [\x20-\uD7FF]
/ [\uE000-\uFFFD]
// / [\x10000-\x10FFFF] We're parsing a UTF16 code unit at a time. :(
/ $([\ud800-\udbff] [\udc00-\udfff]) which is he peggy bits you need. I thought that node-abnf would generate that for you automatically. Can you attach a pared-down input that shows the problem, please? |
That's basically the problem. I'm not trying to include ALL code points above FFFF. I think the special case code is only if you are doing FFFF-10FFFF exactly.
Sure, NMCHARS = NMCHAR *(NMCHAR)
NMCHAR = %x10000-EFFFE Interestingly, however |
willfix. |
thanks. I'm replying to the peggy discussion too. |
- skip non-BMP rules for testing - see hildjj/node-abnf#25
This should correctly generate UTF-16 ranges for Peggy from any valid ABNF range. Fixes #25.
This should correctly generate UTF-16 ranges for Peggy from any valid ABNF range. Fixes #25.
with #26, this will generate: NMCHARS
= NMCHAR (NMCHAR)*
NMCHAR
= "\ud800" [\udc00-\udfff]
/ [\ud801-\udb7e] [\udc00-\udfff]
/ "\udb7f" [\udc00-\udffe] Which an exhaustive test of 0-0x10ffff makes me believe is correct. Does it look right to you? |
- add keyboard abnf and sample files and automated tests Temporarily skip non-BMP chars, see hildjj/node-abnf#25 which is being fixed.
@hildjj (btw thanks for looking into all of this!) = "\ud800" [\udc00-\udfff]
/ [\ud801-\udb7e] [\udc00-\udfff] minor: seems like these two could be one rule, right? Otherwise yeah, this looks good! |
They could be combined into one rule in this case, but not without even more edge-case logic than is already in the current patch. I'm going to land what I have, and will accept a PR from anyone who wants to fiddle with it to optimize further. |
I've left a comment in ast.js:398 that will give the necessary information to anyone that wants to fix that last remaining nit. |
Hi, thanks for this tool.
Trying to implement NMTOKEN - https://www.w3.org/TR/REC-xml/#d0e804 which includes a range "x10000-#xEFFFF" or in abnf
%x10000-EFFFF
and got the error:Of course U+10000 does fit in UTF-16 as two code units. Anyway, looking at the code perhaps this is a Peggy limitation?
Edit: yes, of course, https://peggyjs.org/documentation.html#locations and peggyjs/peggy#15 already discuss this, Peggy is BMP only.
The text was updated successfully, but these errors were encountered: