Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-74668: Fix encoded unicode in url byte string #93757

Closed
wants to merge 4 commits into from

Conversation

roaanv
Copy link

@roaanv roaanv commented Jun 13, 2022

When urllib.parse.parse_qs is called with a byte string, an encoding can be provided which is used to decode the byte string. However, when parsing, parse.py uses 'ascii' encoding to re-encode the parsed data. This breaks utf-8 encoded URLs received as byte strings.

This change uses the encoding passed to parse_qs to re-encode the parsed data.

Caveat
This is probably not the correct solution, but gets us closer to a working implementation that, at worst case, can be dictated by the caller of parse_qs.

My understanding of the problem is as follows.
parse_qs detects the qs is a byte string and decodes it according to the encoding parameter.
After parsing the decoded input, it then re-encodes it (because it detected the input was a byte string), but instead of using the value of the encoding parameter, it uses 'ascii'. The decoding and encoding thus uses different encoders. This PR fixes that, in that it uses the encoding parameter value to re-encode the parsed data.

However, this is not a complete solution.
The problem is that there are in essence 2 encodings involved.
One encoder used to encode/decode the byte string and another to encode/decode the URL (which is utf-8 encoded).
Take b"a=a%E2%80%99b" as an example.
This is a valid ascii encoded byte string and can be decoded with an ascii decoder.
However, after decoding and parsing, it will not produce valid ascii. I.e. parsing of a%E2%80%99b will produce a’b which can not be ascii encoded.

A possible solution would be to pass a reencoding parameter, but since it's unlikely callers will have different encoding and reencoding parameters, this PR opts for reusing the encoding parameter.

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented Jun 13, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@serhiy-storchaka serhiy-storchaka self-requested a review February 21, 2024 14:09
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Feb 21, 2024
urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
@serhiy-storchaka
Copy link
Member

See also #115771 which supports also raw and percent-encoded bytes sequences not decodable with the specified encoding.

@serhiy-storchaka
Copy link
Member

Thank you for your contribution @roaanv, but more universal #115771 has been merged instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants