gh-74668: Fix encoded unicode in url byte string #93757
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When
urllib.parse.parse_qs
is called with a byte string, an encoding can be provided which is used to decode the byte string. However, when parsing,parse.py
uses 'ascii' encoding to re-encode the parsed data. This breaks utf-8 encoded URLs received as byte strings.This change uses the encoding passed to
parse_qs
to re-encode the parsed data.Caveat
This is probably not the correct solution, but gets us closer to a working implementation that, at worst case, can be dictated by the caller of
parse_qs
.My understanding of the problem is as follows.
parse_qs
detects theqs
is a byte string and decodes it according to theencoding
parameter.After parsing the decoded input, it then re-encodes it (because it detected the input was a byte string), but instead of using the value of the
encoding
parameter, it uses 'ascii'. The decoding and encoding thus uses different encoders. This PR fixes that, in that it uses theencoding
parameter value to re-encode the parsed data.However, this is not a complete solution.
The problem is that there are in essence 2 encodings involved.
One encoder used to encode/decode the byte string and another to encode/decode the URL (which is utf-8 encoded).
Take
b"a=a%E2%80%99b"
as an example.This is a valid
ascii
encoded byte string and can be decoded with anascii
decoder.However, after decoding and parsing, it will not produce valid ascii. I.e. parsing of
a%E2%80%99b
will producea’b
which can not beascii
encoded.A possible solution would be to pass a
reencoding
parameter, but since it's unlikely callers will have differentencoding
andreencoding
parameters, this PR opts for reusing theencoding
parameter.