-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: WebAssembly, Unicode and the Web Platform #1419
Comments
Perhaps a few potential solutions I collected from offline feedback so far, for consideration: Separate WTF-16In Interface Types, define:
Define a coercion applied during linking, with the following cases:
The coercion ensures that a This one also introduces an ambiguity in the Web embedding in that passing a I am not particularly attached to the name An optimization akin to UTF-anyIn Interface Types, define:
This potential solution can be considered where well-formedness is required. It would avoid double re-encoding overhead and indirect effects on code size, but leaves the surrogate problem unaddressed. Note that WTF-anyIn Interface Types, define:
This potential solution would also require redefining This one does not introduce lossiness on its own, so everything else indeed becomes just a post-MVP optimization. Integrated W/UTF-anyIn Interface types, define:
By doing so we achieve:
|
IIUC, the root issue is that IT wants strings to be sequences of unicode codepoints but some languages consider strings to be sequences of i8 or i16 values that may or may not correspond to well-formed unicode strings. One simple solution would be to have languages/APIs that accept or produce invalid unicode strings use |
I think the issue is a little more nuanced, in that IT wants to define Other than that, I think C-like byte strings as of |
This is currently my preferred solution too - a Adding an additional |
Ah, does that mean that WTF-8 is not the same as a plain |
It states "Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16." Iiuc, it treats these similarly to how UTF-8 treats overlong or truncated byte sequences. WTF-8 can represent any
WTF-16 maps 1:1 to random |
IIUC, WTF-8 is not quite the same as arbitrary However, WTF-16 is the same as EDIT: should have refreshed :) |
I posted a first question/answer focused just on the question of surrogates in interface-types/#135. I think that's the high order bit and, if we can agree on that, then a subsequent discussion on supporting one or more encoding formats will be simpler. |
Thank you, Luke. If you'd be willing to support "Separate WTF-16" as of above (the coercion is crucial to enable accessing WASI APIs and to interface with JavaScript without glue code), I would feel comfortable with the suggested |
Having a separate For Web-exclusive use cases, I think it would make sense to solve the problem in the JS or Web APIs. E.g., it's easy to imagine JS API's for "binding" wasm imports and exports. This is already an approach being taken in other emerging JS APIs, like stack-switching, and I've been wondering if where we're going is general "bind import"/"bind export" JS APIs that are able to handle the Web-specific cases of Promises, JS strings and typed array views. |
Technically true, but also misses that strings would at least always work in between separately compiled modules in the same language, any compatible language, and JavaScript, even without upfront knowledge of what kind of module one is interfacing with. That's typically the majority of cases I think. As such seems like a reasonable compromise to me, also because it allows dedicating the desired
The alternative of occasional breakage seems way worse to me, so if that's what it takes I think that most people will be fine with it. Perhaps a good name for the second string type (
Unfortunately, in the absence of an escape hatch for affected languages it doesn't matter much to me how sound anyone's reasoning of a purported trend is, since as long as the IT MVP is going to break something somewhere for someone, and is largely useless for the JavaScript-like language I am working on, I can only oppose it. Hence I am trying to find a reasonable solution or compromise everyone can live with, and it would make me happy if we could cooperate. |
I don't see how what you're saying addresses the problems raised in interface-types/#135 or provides counterevidence that IT wouldn't be viable in general without the inclusion of a new (FWIW, If we can agree on the absence of surrogates, I think it would make sense to talk about supporting UTF-16 as an additional encoding in the canonical ABI of |
I appreciate your second paragraph, in that it would already solve some very annoying problems. I agree that supporting UTF-16 is useful separately, and I would appreciate it being added to the explainer / MVP. Count me in! I am having a hard time to follow your arguments in the first paragraph, however. Perhaps if you don't believe me, here is Linus Torvalds explaining a very important rule that I think extends beyond the Linux kernel: Don't break userspace. And here is him in the same talk, upholding the programmer wisdom of If it's a bug that people rely on, it's not a bug, it's a feature, only to continue with:
And not having to worry about surrogates is indeed a sort of feature, in that users can do a careless I really do not know how much more evidence I need to prove that designing something in a way that ignores the current reality is a bad idea. In fact, this seems to be OK in Interface Types only, while we are keeping every other proposal to very high standards. And while I'm not an expert on this, I think it was the Unicode standard itself making this exact same mistake in regards to the needs of UCS-2 languages by means of insisting on USVs, leading to about a decade worth of similarly desperate discussions (can recommend the entire thread, but especially the last comment before it went silent), in 2014 culminating in the description of the commonly applied practical solution that is the WTF-8 encoding.
|
The reason JS, Java and C# have the strings they do is that, by the time Unicode realized that 2 bytes wasn't enough and thus UCS-2 wasn't viable, a bunch of code was already written, so these languages simply didn't have a choice. Similarly, for Linux syscalls exposed to userspace. In contrast, no code exists today that is using APIs defined in IT, so we don't have the same backwards-compatibility requirements. For many reasons, wasm and Interface Types are intentionally not seeking to perfectly emulate an existing single language or syscall ABI. It may be a valid goal, but that would be a separate project/standard/layer than the component model. This is the benefit of layering and scoping: we don't need one thing that achieves all possible goals. I want to reemphasize that of course inside a component, strings can be represented in whatever way is appropriate to the language, so we're really only talking about the semantics of APIs. As for the already-defined Web APIs:
Thus, I still don't think we have any evidence suggesting that IT won't be viable without carrying forward these WTF-16 string semantics, which is I think the appropriate question for an MVP. |
A couple points I disagree with:
This is a separate issue now, and in the post before I was talking about what I believe to be a reasonable compromise to solve lossiness. In particular, I would be OK with your reasoning in the separate issue, but only when there is a lossless fallback available. This is not an either/or in my opinion. If not, I would remain of the opinion that WTF-8/16 is the more inclusive, less restricted choice, and as such is preferable, also because one of Wasm's high level goals is to integrate seamlessly with the Web platform respectively maintain the backwards-compatible nature of the Web, and that also applies to Interface Types.
This is sadly not sufficient in our case, where we currently have glue code like: const STRING_SMALLSIZE = 192; // break-even point in V8
const STRING_CHUNKSIZE = 1024; // mitigate stack overflow
const utf16 = new TextDecoder("utf-16le", { fatal: true }); // != wtf16
/** Gets a string from memory. */
function getStringImpl(buffer, ptr) {
let len = new Uint32Array(buffer)[ptr + SIZE_OFFSET >>> 2] >>> 1;
const wtf16 = new Uint16Array(buffer, ptr, len);
if (len <= STRING_SMALLSIZE) return String.fromCharCode(...wtf16);
try {
return utf16.decode(wtf16);
} catch {
let str = "", off = 0;
while (len - off > STRING_CHUNKSIZE) {
str += String.fromCharCode(...wtf16.subarray(off, off += STRING_CHUNKSIZE));
}
return str + String.fromCharCode(...wtf16.subarray(off));
}
} First, since we care a lot about Chrome and Node.js, we found that V8's On the other hand, Rust for example would not need this, which is one of the reasons why I think that IT, right now, is not as neutral as it should be. In general I think that the point of IT
The first half is technically true, since IT does not exist yet, but IIUC our requirements do include improving existing use cases, like for example accounting for the clumsy chunk of glue code above. Ideally for as many languages as possible, so post-MVP becomes indeed "just an optimization" as you said in your presentation. On the contrary, right now IT basically starts with what is already an optimization for languages that can make use of an UTF-8 encoder/decoder, which I think is not neutral.
I read this as if I was of this opinion, which I totally am not. I am willing to give you the benefit of the doubt here, but would like to add that in my opinion, IT is currently unnecessarily restricted and as such serves only a very specific set of languages well. On the contrary, WTF-8/16 is the more inclusive encoding that I would have expected to be the logical default, also because it roundtrips to JS strings. We disagree here, but only in the absence of a proper escape hatch. If a viable lossless alternative would exist, so nobody is broken or unnecessarily disadvantaged, I would be fine with your reasoning of the default string type.
We disagree here. In particular, I think my presentation and comments present reasonable doubt that it may, in some cases, even though rare, be very meaningful (say where integrity is required), and I am of the opinion that "We should be very reluctant to introduce hazards hoping to improve our Unicode hygiene." That is, if we can, I believe that we should design the canonical ABI in a way that is guaranteed to work in the following important cases as well: Java/C#/AS<->JS, Java/C#/AS<->Java/C#/AS. Replacement on other paths is probably unavoidable, but at least languages and users have a choice then, respectively the default is not already broken in rare cases.
In the presence of reasonable doubt and the absence of willingness to explore what I believe to be a reasonable compromise, I would expect that the burden of proof is now on you. Again, I am willing to leave the default string to you and a well-formed future, but not at the expense of not accounting for what may be rare, but still, hazards. Many popular languages can be affected by this, and that may become really hard to justify in the future once they realize. |
I agree that the JS glue code isn't ideal, but I think the right fix for that is in the JS API or in JS, not by adding the concept of wtf-16-string to the whole future component ecosystem. Beyond that, I don't see new information to respond to that hasn't already been responded to; it seems like we're mostly disagreeing on questions of goals/scope. |
I would expect the The even more interesting anomaly, is, however, that there isn't even a /** Allocates a new string in the module's memory and returns its pointer. */
function __newString(str) {
if (str == null) return 0;
const length = str.length;
const ptr = __new(length << 1, STRING_ID);
const U16 = new Uint16Array(memory.buffer);
for (var i = 0, p = ptr >>> 1; i < length; ++i) U16[p + i] = str.charCodeAt(i);
return ptr;
} As you can see, this is a major pain point for something like Java, C#, AS and others, and both of these would still be necessary when a |
There's a whole space of options beyond |
FYI: Added the "Integrated W/UTF-any" option that came up in WebAssembly/interface-types#135 (comment) to the list of suggestions above :) |
This issue is for accompanying discussion of "WebAssembly, Unicode and the Web Platform". The presentation is pre-recorded, which is what we decided to try out in WebAssembly/meetings#775, with discussion time scheduled for June 22nd's CG video meeting.
(click to play)
Please note that I mention some concepts that I would expect to be well known among CG members, but I decided to include them nonetheless to also make the presentation approachable for those unfamiliar with the topic. Feedback welcome!
Related issues:
in particular the second outstanding vote (May 25th CG video meeting) on whether we should proceed with the next steps.
The text was updated successfully, but these errors were encountered: