-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
surrogate pair handling is slightly off #1804
Comments
The kernel line discipline does a couple of things to the data, and expects it to be ASCII-compatible. UTF-16 cannot be used here. See here. (No clue about Windows.)
I guess you already do it when handling data arriving from the apps, don't you? See e.g. https://bugzilla.gnome.org/show_bug.cgi?id=154896 (one of my earliest contributions to vte) or https://phab.enlightenment.org/T655 (a still unfixed one) – quite some terminal emulators go through this bug during their early stage of development :) |
Yes I am aware of that, thats why we currently handle the pty data as UTF8 encoded (which gets penalized by the JS string conversion as it converts a valid ASCII char into 2 bytes). Windows switched to UTF8 with their new conpty implementation it seems, before it was either some 8bit encoding (like in old unix systems) or UTF16 (what they call "ANSI encoding"). About UTF8 in the emulator - nope we dont care for UTF8 at all atm, we currently leave the half transmitted problem to nodejs (in node-pty) or the JS engine (when it arrives as JS string in the browser engine). The emulator currently parses UTF16 (since thats the closest thing to JS strings) and transcodes the chars to UTF32 for the terminal buffer. With the new typed array buffer we now can decide which encoding to go with internally, and I think we can save some forth and back encoding by supporting UTF8 byte sequences as data entry. (Still undecided, see #791 (comment)). Edit: I am pretty sure we dont want to go full UTF8 for internal representation as we would inherit the multibyte problem all over the place which complicates string handling a lot (Ive started a rust clone of the parser + input handler, even there with rust's native UTF8 strings it performs worse than UTF16). UTF16 might be worth a shot for the internal buffer as surrogate pairs are quite rare. But this is kinda offtopic here... |
Closing the issue as it got resolved by the utf32 decoding automatically. |
The surrogate pair handling in
Inputhandler
is slightly off - currently we preserve any last char inInputHandler.print
that counts as a first surrogate and re-add it to the next data chunk. SinceInputHandler.print
gets called for every printable slice of chars within a single chunk of data this possibly skips printable slices in the data chunk and wrongly moves the surrogate char to the next chunk.Solution:
We should only check for surrogates at the real chunk end and re-add the char to next chunk.
Why do we care at all? Isnt it guaranteed to get only valid unicode codepoints? - No. Few notes on the background:
JS strings are represented as UCS2 encoding with optional UTF16 surrogate pair extension. This means that the browser will correctly decode to a high plane unicode codepoint if the 16bit charCodes form a valid surrogate pair (UTF16 handling), but it will also show something when a single char of the surrogate pair charCode range appears (UCS2 handling).
A terminal emulator typically speaks with some program running behind a pseudoterminal. Nowadays all pseudoterminals default to UTF8 as transport encoding, even Windows since one of the last rolling releases (well thats only partially true, pseudoterminals do not care much about encodings at all and rather pump raw bytes that happen to be UTF8 encoded, which leads to weird mixes of UTF8 and raw bytes).
node-pty
decodes the UTF8 byte sequences into a JS string that gets forwarded to the emulator. On the way to the emulator the data has to go through several buffers with several forth and back encoding. Most transport layers only care for byte sequences, since our emulator data entry point is a JS string Javascript gives the guaranty that it forms valid UCS2 charCodes, but does not "wait" for valid UTF16 surrogates. Thus a surrogate pair can span 2 chunks of data.Because of the UTF8 source type of the data we can safely assume UTF16 in the emulator, but have to do the surrogate pair matching explicitly at the chunk borders since JS does not do it for us.
Btw same applies to UTF8 input, if we decide to go full UTF8 we have to handle partially transferred codepoints as well.
The text was updated successfully, but these errors were encountered: