gh-74668: Fix encoded unicode in url byte string #93757

roaanv · 2022-06-13T05:30:13Z

When urllib.parse.parse_qs is called with a byte string, an encoding can be provided which is used to decode the byte string. However, when parsing, parse.py uses 'ascii' encoding to re-encode the parsed data. This breaks utf-8 encoded URLs received as byte strings.

This change uses the encoding passed to parse_qs to re-encode the parsed data.

Caveat
This is probably not the correct solution, but gets us closer to a working implementation that, at worst case, can be dictated by the caller of parse_qs.

My understanding of the problem is as follows.
parse_qs detects the qs is a byte string and decodes it according to the encoding parameter.
After parsing the decoded input, it then re-encodes it (because it detected the input was a byte string), but instead of using the value of the encoding parameter, it uses 'ascii'. The decoding and encoding thus uses different encoders. This PR fixes that, in that it uses the encoding parameter value to re-encode the parsed data.

However, this is not a complete solution.
The problem is that there are in essence 2 encodings involved.
One encoder used to encode/decode the byte string and another to encode/decode the URL (which is utf-8 encoded).
Take b"a=a%E2%80%99b" as an example.
This is a valid ascii encoded byte string and can be decoded with an ascii decoder.
However, after decoding and parsing, it will not produce valid ascii. I.e. parsing of a%E2%80%99b will produce a’b which can not be ascii encoded.

A possible solution would be to pass a reencoding parameter, but since it's unlikely callers will have different encoding and reencoding parameters, this PR opts for reusing the encoding parameter.

Issue: urllib.parse.parse_qsl does not handle unicode data properly #74668

cpython-cla-bot · 2022-06-13T05:30:15Z

All commit authors signed the Contributor License Agreement.

bedevere-bot · 2022-06-13T05:30:16Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

urllib.parse functions parse_qs() and parse_qsl() now support bytes arguments containing raw and percent-encoded non-ASCII data.

serhiy-storchaka · 2024-02-21T16:18:22Z

See also #115771 which supports also raw and percent-encoded bytes sequences not decodable with the specified encoding.

serhiy-storchaka · 2024-03-05T15:59:37Z

Thank you for your contribution @roaanv, but more universal #115771 has been merged instead.

pythongh-74668: Fix encoded unicode in url byte string

6b942a6

bedevere-bot added the awaiting review label Jun 13, 2022

blurb-it bot and others added 3 commits June 13, 2022 05:32

📜🤖 Added by blurb_it.

6078fc5

Updated test scenarios to correctly illustrate parsing encoded urls

15d1a2a

Merge branch 'main' into fix-issue-74668

3bbf583

bedevere-app bot mentioned this pull request Feb 21, 2024

urllib.parse.parse_qsl does not handle unicode data properly #74668

Closed

serhiy-storchaka self-requested a review February 21, 2024 14:09

serhiy-storchaka closed this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-74668: Fix encoded unicode in url byte string #93757

gh-74668: Fix encoded unicode in url byte string #93757

roaanv commented Jun 13, 2022 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jun 13, 2022 •

edited

Loading

bedevere-bot commented Jun 13, 2022

serhiy-storchaka commented Feb 21, 2024

serhiy-storchaka commented Mar 5, 2024

gh-74668: Fix encoded unicode in url byte string #93757

gh-74668: Fix encoded unicode in url byte string #93757

Conversation

roaanv commented Jun 13, 2022 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Jun 13, 2022 • edited Loading

bedevere-bot commented Jun 13, 2022

serhiy-storchaka commented Feb 21, 2024

serhiy-storchaka commented Mar 5, 2024

roaanv commented Jun 13, 2022 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jun 13, 2022 •

edited

Loading