Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytes vs. characters and "cookie charset" #15

Closed
bsittler opened this issue Jul 12, 2016 · 11 comments
Closed

Bytes vs. characters and "cookie charset" #15

bsittler opened this issue Jul 12, 2016 · 11 comments

Comments

@bsittler
Copy link
Contributor

bsittler commented Jul 12, 2016

Most modern browsers assume UTF-8 when exposing cookie data to scripts and <meta http-equiv=set-cookie ... >, but IE and Edge use the system locale's "ANSI" codepage for this instead (using silent lossy conversion on write), causing a lack of interoperability in practice. The cookie jar itself seems to be byte-oriented and eight-bit-clean in all modern browsers. In practice, using URL-encoding or Base64 armoring is possible but adds a lot of overhead (encodeURIComponent and escape inflate characters up to 3x, base64 1.5x), decrease readability and debuggability (often the data is user-entered and users can use browser cookie jar inspectors to look at it), and (in the case of base64) don't have a built-in codec in IE. Length inflation also runs up against cookie length and cookie jar per-domain size caps.

As a result, sites storing non-ASCII data (often user input) in cookies either need to deal with some degree of cross-browser incompatibility or need to use an ugly and inefficient workaround. On the server-side, guessing based on User-Agent sniffing combined with approximation based on IP geolocation, Accept-Language analysis, and/or script-provided IE-specific navigator.systemLanguage is the best hope for portably encoding/decoding cookies which will be shared with scripts and/or set in HTML.

Given all this, I think it would be nice to have the new async cookies API allow easy use of raw UTF-8 in all browsers but also provide a way to read and write cookies in the browser's default "cookie charset" as well as raw bytes.

@domenic
Copy link

domenic commented Jul 12, 2016

Given all this, I think it would be nice to have the new async cookies API allow easy use of raw UTF-8 in all browsers but also provide a way to read and write cookies in the browser's default "cookie charset" as well as raw bytes.

I don't think this conclusion is warranted. I would instead say:

Given all this, I think it would be nice to have the new async cookie APIs operate purely on USVStrings not make the author ever worry about any of this stuff.

@bsittler
Copy link
Contributor Author

That would be ideal for newly-built sites, but will cause unrecoverable data corruption if an existing site is migrating to the new API, especially if they still need to use document.cookie or <meta http-equiv=set-cookie ...> elsewhere in the same domain with non-ASCII data.

@bsittler
Copy link
Contributor Author

Also, I agree that USVString is the right script-level interface to the feature; by "raw UTF-8" I meant that the USVString-using interface should use UTF-8 encoding when serializing/deserializing cookies so that it is compatible on the server side with the existing behavior of most modern browsers.

@domenic
Copy link

domenic commented Jul 12, 2016

It would mean that existing sites cannot use the new API if they have stored non-ASCII data with the old API, it's true. But given that it's impossible to store such data in a portable way today, that seems fine.

@bsittler
Copy link
Contributor Author

Right, but it also breaks interoperation between document.cookie/meta h-e=s-c and the async API on the same site. There would be no way to round-trip data in IE between the two interfaces.

@domenic
Copy link

domenic commented Jul 12, 2016

That's fair. But the correct way to fix that is to make those features interoperable, instead of adding a new set of APIs that extend that non-interoperability and continue to behave differently in all browsers.

@bsittler
Copy link
Contributor Author

bsittler commented Jul 12, 2016

Agreed. I guess a possibly-better resolution would be for IE Edit: Edge to store a new "UTF-8" cookie parameter on each cookie that touches the new API, and do a one-time conversion from legacy encoding to UTF-8 the first time an existing cookie touches the new API and is not decodable in UTF-8 (it may have been server-set in UTF-8, after all). This however runs the risk of breaking cookies whose values aren't actually UTF-8 but happen to decode in UTF-8. Edit 2: Actually such a flag is only needed on writes.

@annevk
Copy link
Collaborator

annevk commented Jul 18, 2016

I agree with @domenic. Fix the core problem, don't paper over it with additional APIs.

@bsittler
Copy link
Contributor Author

@adrianba @aliams Any idea how best to reach cross-browser interoperability on cookie charset? It looks like other than IE/Edge, most modern browsers use UTF-8 for this; see https://inikulin.github.io/cookie-compat/#CHARSET0001 (and nearby cases) for data and whatwg/html#804 for context

@bsittler
Copy link
Contributor Author

Also, Safari seems to truncate at the first non-ASCII byte!

The explainer currently mandates UTF-8 interpretation for bytes for predictable interoperation and affordable internationalization (in terms of byte count in the cookie jar and in terms of complexity.) I'd love to hear your thoughts on it, and would be happy to address any outstanding issues (perhaps in pending pull request #17 ?)

@bsittler
Copy link
Contributor Author

Closing this issue for now, but I'm happy to reopen this discussion if browser implementations are not ready for consistent UTF-8 handling for cookies or would like to have a more detailed discussion of how to change this behavior without breaking apps.

pwnall added a commit that referenced this issue Feb 5, 2020
1) Fixes all instances of await being used in non-async functions.
2) Fixes unfinished sentences noticed along the way.

Fixes #14, Fixes #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants