-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommend avoiding invalid surrogate pairs #12
Comments
Thanks. RFC 8259 addresses this, so I agree that the JSON5 spec should address this too. |
May I ask why |
@ell1e Unicode escapes are limited to UTF-16 code units because that's the syntax available in ES5. |
Oh, how unfortunate. I get if you want to stick to it since JSON traditionally mapped to valid JavaScript, but I would suggest at least considering to change this if an opportunity presents itself (new revision of the spec, if that is ever done). After all, technically it shouldn't be too hard to write a JavaScript parser that reads this, even when ES5 itself wouldn't accept the according escape. |
Just to explain this further, I feel like there are a lot of headaches in the world because of UTF-16 surrogates. E.g. the ES6 specification leads to confusing surrogating-the-surrogates in languages like Python 3 which use surrogates already for implementing broken values like in WTF-8 which is needed to retain broken surrogates in e.g. broken UTF-16 implementations like windows in an UTF-32 string: in Python 3, I believe any broken surrogates will themselves need to be treated as binary and remapped with other surrogates if it is supposed to be a non-lossy conversion. (At least I'm quite sure that is what CPython will do internally, if you decode Unicode with error handling being surrogate escaping.) That sort of surrogate-hacking around broken surrogates is just mind-bending to me, and it confuses me why we would want more of that in existence. Edit: obviously, the above isn't a technical problem since this will be handled transparently, so it works just fine. I just find it conceptually so weird that I'm wondering, why would you even want to encourage people to keep using broken UTF-16 surrogates? I am therefore also suggesting that it might be better to reject broken UTF-16 surrogates as invalid due to this, because if one really wants them they can be encoded with But that's just my opinion, obviously! |
@ell1e You make some good points. I'm wondering why the ECMAScript Specification decided to keep unmatched surrogates rather than reject them as invalid, and whether their reasoning should pertain to JSON5. This restriction does not exist in the JSON spec either, but that may be because so many JSON documents already exist in the wild, and adding that restriction would invalidate some JSON documents, causing interoperability issues. This technically would be a breaking change, but it would still allow JSON5 to be backward compatible with ES5 and JSON. I'll have to think about this some more. |
Ouch, I never noticed these broken code-points-but-actually-UTF-16 actually were in original JSON, I assumed
And that's kind of all on it. Honestly, it feels like they just mixed up unicode character and utf-16 encoding (which I think was common before utf-32 became more adopted as the "true" representation) and as if they didn't even think about the special case of invalid surrogates. As didn't Windows, or anyone else doing UTF-16 without thinking about how to integrate surrogates properly when they were added... oh well. I guess if it's broken in JSON that's kind of a point for just sticking with it 🤷♀️ |
Oh yeah. JSON5 is a superset of JSON, so if it's allowed in JSON it needs to be allowed in JSON5. Adding this restriction would actually break compatibility with JSON. So it's a no go. |
Yeah I agree. Ignore all I said 😂 Edit: I mean to be fair, JSON doesn't really say it's allowed either. But in practice probably most implementations will allow it, so probably too late to try to curb it now. |
Your points are still valid, so I think it would be good to add a recommendation to avoid invalid surrogate pairs, similar to how avoiding duplicate names and using RFC 3339 formatted dates are recommendations, but not hard requirements. |
Edit: the following is basically all nonsense, nevermind 😄
|
I'm not sure I fully understand this discussion, but it's my understanding that
As such |
Oh, haha, my bad. Yeah in that case it's moot, it seems like I should just read the spec better and write less 🤦♀️ Edit: I'm wondering though, isn't that quite unusual? I've never seen |
@GULPF, you hit the nail on the head. @ell1e, Java is the exception, but escapes in Java have always been weird. For example, |
@ell1e I think I misunderstood your question. ECMA languages, like Java, JavaScript, and C# treat strings as a sequence of UTF-16 code units. So |
I guess this is more of a C/C++ thing, that |
ECMA languages are the exception. Most other languages store strings as bytes, so Rust is another exception as it stores character strings as UTF-8 and only allows |
Yeah the Rust behavior is what I expected |
It would be nice to have support for The G clef character "𝄞" (U+1D11E) would be escaped as |
There is absolutely a requirement that \xNN turn into the BYTE with the value 0xNN. Otherwise it is impossible to store arbitrary bytes as strings in JSON5. This makes it impossible to use JSON5 to represent any data that is allowed to contain arbitrary bytes, such as Unix filenames. Apparently Windows users are not living in fantasyland and that is why invalid \uNNNN sequences are always supported by any Windows software, they realize the absolute nightmare of having a subset of allowed filenames or data fields not be able to be stored or retrieved from strings. I very much recommend that a byte-based api be added that always returns any \xNN sequences as the exact equal byte, and turns any incoming invalid UTF-8 bytes (which will always have the high bit set) into \xNN sequences. It must also turn any invalid \uNNNN sequences (unpaired surrogates) into the obvious 3-byte UTF-8-like encoding, though I am unsure if there is a need for the reverse conversion (to \uNNNN rather than to three \xNN sequences). There are all kinds of variations of what to do with any "wide character" api for the library. \xNN could turn into \u00NN for back compatibility, or into \uDCNN like in WTF-8 for a somewhat more lossless experience. I don't have any real opinion on this, as the 8-bit string api is what is really needed. |
@spitzak Thanks for the suggestions.
How does JSON currently handle this issue? |
I believe it is somewhat random whether \xNN turns into the given byte value or into a pair of bytes (first one \xC2 or \xC3), depending on the JSON library. Typically a single byte if it uses 8-bit strings internally, two bytes if it uses 16-bit strings internally. My recommendation is that the single byte be made official behavior. |
But any JSON library that parses In my experience, byte sequences are stored as arrays of numbers in JSON. |
Yes, making an array of numbers is one of the possible solutions, however that is extremely user-unfriendly when 99.99999% of the strings actually are readable text. The reason to allow invalid bytes is so this does not have to be done and filenames, including the valid portions of ones containing invalid bytes, can remain readable. |
I see your point, but I'm not sure it's wise to redefine what a string is in JSON5 to accommodate this single use case of mixed text and bytes.
If your string is really an array of bytes, then it should be stored as an array of bytes, not a string. Both ECMAScript and JSON (ECMA-404) define strings as sequences of Unicode code points, whereas JSON (RFC 8259) and JSON5 define strings as sequences of Unicode characters. The implication is that strings are meant for text, not binary data.
I'm not convinced that an array of bytes is any less of a security risk than |
Yes it would be a good idea to disallow \xNN for bytes that are easily encoded another way, for security reasons. This is a problem right now even if you ignore the invalid UTF-8 problem. It would be nice if various ASCII strings of interest can always be found with a grep command. If invalid bytes require an array of numbers then you are forced to use this unsafe pattern. But it may be plausable to use:
I am however very concerned that this unusual syntax, and the need to support it in applications rather than the parsing library, is making things worse, not better. I really don't see any good solution other than fixing the meaning of \xNN. |
This comment was marked as off-topic.
This comment was marked as off-topic.
That sounds like what I was complaining about. I do not want to have to write all the filenames as Base64 just because 0.00001% of them have some invalid UTF-8 in them! There must be a human-readable way to make text that contains invalid UTF-8 and invalid UTF-16. |
@spitzak I can't tell if you're agreeing with my comments or not. If not, can you please propose an alternate recommendation to the spec. |
I'm now not sure I understand what you are saying. What I want is the sequence "\xNN" to turn into the byte with the value 0xNN when a quoted string is read from the JSON5 library using a UTF-8 api. It also should return \uNNNN as 3 bytes when U+NNNN is an unpaired surrogate, and \uNNNN\uMMMM as 4 bytes when they are a proper pair of surrogates. In addition the reverse conversions should be done when a UTF-8 string is given to the JSON5 library. I am confused by the discussion of "text vs binary". Unless there are two different API's for the JSON library, there is no way for it to tell so it has to make these escape sequences work in all strings. I guess "recommendation" means it does work all the time and you are recommending that it not be used, but I don't consider this really a useful part of the spec. You can also "recommend" that all the words in a string be spelled correctly but this has nothing to do with JSON checking or enforcing that. |
@spitzak So, I went back and read your original comments, and I think the issues you raised, while valid concerns, are moot in the context of this issue. Unpaired surrogates can only occur when Unicode code units within the range of U+D800 through U+DFFF are used. Note that these code units are greater than The I think some confusion comes in when we start talking about UTF-8 and UTF-16 in the context of binary strings. These are two different things. UTF-8 and UTF-16 define how Unicode code points should be converted to and from bytes, and code points and bytes do not have a one-to-one relationship. However, binary strings are sequences of Unicode code points, limited to the range of U+0000 through U+00FF, that each represent a single byte, giving code points and bytes a one-to-one relationship. Note that it doesn't matter whether the binary string is encoded as UTF-8 or UTF-16 because we are not analyzing the UTF-8 or UTF-16 bytes. We are analyzing the code point values as if each one was a byte. So, I don't understand what you mean when you say you want Recommending that JSON5 authors and generators avoid invalid surrogate pairs has no bearing on the |
Sorry about the confusion about bytes and unpaired surrogates. They are not related. By "UTF-8" api I mean a function in the JSON5 library that returns an array of bytes that have been set based on quoted text in the file, which converts the character 'A' into 0x41, and converts the character '€' into 0xE2,0x82,0xAC (the UTF-8 encoding of the euro sign).
There is also another api that takes a UTF-8 string and writes a quoted representation of it to a JSON5 file, this must do the exact opposite conversion (ie it accepts any array of bytes and figures out a way to quote it, using \xNN sequences when necessary). In addition there is a "UTF-16" api that takes an 'A' in a quoted string and turns it into the 16-bit word 0x0041, takes '€' and turns it into 0x20AC, and takes Unicode code points greater than U+FFFF and turns them into two 16-bit words (a surrogate pair). My other recommendations, all are far less important than the "\xAB" to single byte support:
|
@spitzak Thank for you clarifying your points. I still think there is some confusion around the term "UTF-8 string", but I believe I have a better idea of what you're proposing, even if I still don't agree with it. This seems to be an API feature request rather than a specification request, and if so, I would recommend that you open this request as an issue in the json5 reference implementation repo. If you're asking to add a section to the spec recommending that JSON5 libraries implement this API, then I recommend opening this request as a separate issue in this repo. In either case, let's continue this discussion elsewhere since we're straying a bit too far from whether the spec should recommend avoiding invalid surrogate pairs. |
Unicode code points in the range U+D800 to U+DFFF are used for encoding UTF-16 surrogate pairs, and can not occur on their own in a valid Unicode sequence. However, the
\uHHHH
escape sequence can be used to construct such a an invalid Unicode sequence, e.g"\uDEAD"
.The interpretation of invalid Unicode sequences like
"\uDEAD"
is not specified in the JSON, JSON5 or ES5 (I think) specs. However, it is specified in the ES6 spec:This also matches the behavior of the JSON5 reference implementation. I think it's best if this is also clarified in the JSON5 spec.
The text was updated successfully, but these errors were encountered: