-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390
Comments
I don't think Node should support something that goes against the spec. In this case Node assumes that any encoding is done by the user and values received conform to the spec. The problem with changing this is that there's going to be a significant performance penalty. More to the point, if you've seen examples in the wild then file a bug against those implementations and not Node. The HTTP spec doesn't (and shouldn't) evolve based on faulty implementations. |
/cc @nodejs/http @nodejs/http-parser |
@apapirovski On the other hand, IRIs, who's value is reflected in HTTP, are starting to get adopted and gain momentum. Maybe it's possible to have a white-list of headers like Setting it as an option for the request parser with a default to |
Even IRIs specifically mention this as being problematic in relation to http and cover percent encoding extensively.
|
@apapirovski The spec here is actually wrong or at least not specific enough. Following RFC2616, IRIs are only not allowed in the context of the URL and path related tokens but they are as part of header values. (By path related tokens I mean: "URI-reference", "absoluteURI", "relativeURI", "port", "host", "abs_path", "rel_path" and "authority" defined in RFC1808 and RFC2396 and used in RFC2616.) Historically standards haven't always been very precise or reflecting the current state of affairs which is how the concept of "de facto standard" appeared. This on top of the fact that before RFC7230, the bytes of a UTF-8 representation were actually allowed by spec to be transmitted. If Node.js should follow the standard or de facto standard is of course debatable but I would argue that it wouldn't be the first time as suggested by the relatively recent URL parser implementation in Node.js 7.x. e.g:
Which shows that Node.js's URL parser will accept the explicitly forbidden (rare IP address formats)[https://tools.ietf.org/html/rfc3986#section-7.4] so as to correctly convert it to the dotted decimal format. |
RFC 7230 is quite specific on the matter,
And here:
The spec cannot be "wrong", but it is possible that some implementations of the spec are not consistent or compliant with the spec. Specifically, the rule in the spec that says "A recipient SHOULD treat octets in field content (obs-text) as opaque data" specifically means that implementations are not supposed to interpret UTF-8 bytes. Before RFC 7230, RFC 2616 was ambiguous about header values. The spec did not say anything about how header value bytes were to be interpreted so any claim that UTF-8 was allowed by the spec is incorrect. Pre RFC 7230, header values were strictly opaque sequences of octets. RFC 7230 specifically eliminates that ambiguity. Extended characters, including those contained within IRI's are to be handled using RFC2047 encoding and anything beyond that is considered opaque, uninterpreted data. The new WHATWG URL parser is a fundamentally different thing, with a different set of rules. It was written to conform to the WHATWG URL Standard and not to RFC 3986. |
I would say this has been answered in a satisfactory manner by @jasnell. Please feel free to reopen if you believe I've closed this in error or there's any new information. |
Why not support UTF-8 strings in headers? |
for who use the
|
Hi,
I've encountered an issue regarding the way HTTP header values are decoded.
The HTTP Parser project might be a better place to post this issue to but I thought I'd post here first.
Currently it would seem that Node.js is decoding HTTP header values as US-ASCII / ASCII-7.
This becomes an issue now that browsers and servers started supporting UTF-8 values as well.
A simple example would be a website that has a URL who redirects to a non-percent-encoded UTF-8 URL. e.g:
The first log will produce:
Location: /f��b��r
(there are invisible characters next to the Ã's).The second log will produce:
Location: /fÖÖbÃÃr
, which is the correct and expected resultIn this example, to follow the redirect, you'd need to first instantiate a buffer in
'binary'
encoding and then stringify it to it's'utf8'
representation.The original RFC2616 that defined HTTP seemed to allow for any byte value with a few restrictions on control characters:
cf: https://tools.ietf.org/html/rfc7230#section-3.2.6
cf: https://tools.ietf.org/html/rfc2616#section-2.2
The follow-up update to HTTP, RFC7230, seems to change that and restrict them to US-ASCII / ASCII-7:
cf: https://tools.ietf.org/html/rfc7230#appendix-A.2
I would expect most Node.js HTTP clients to thus fail on the example I provided above.
Since browsers seems to support it and servers started sending it (I've seen examples in the wild), I think we can say that it has become a de facto standard and that it would be nice if either Node.js core or HTTP Parser would support reading HTTP header values as UTF-8 by default.
The text was updated successfully, but these errors were encountered: