Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make chardet/charset_normalizer optional dependencies #5875

Closed
wants to merge 4 commits into from

Conversation

akx
Copy link

@akx akx commented Jul 15, 2021

An implementation of #5871 for consideration.

Unlike my suggestion in #5871, this implementation prefers chardet / charset_normalizer if it's available, with the idea that those libraries should hopefully do a better job than trying to decode the payload in the two most likely encodings.

Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.

This is still missing documentation, changelogs, etc. – help appreciated.

Considering the switchover to charset_normalizer has broken some users' workflows anyway, I think it wouldn't be too much of an additional breaking change to do this without a major version bump.

@akx akx marked this pull request as ready for review July 26, 2021 14:36
@nateprewitt nateprewitt changed the base branch from master to main January 3, 2022 15:04
@nateprewitt
Copy link
Member

I think we're going to close this out given the outcome of tracking issue. The long term goal is to remove character detection from Requests. I'm not sure this solution is robust enough and is likely to introduce further unnecessary breakages at this point. We'll track progress on the original issue. Thanks!

@akx
Copy link
Author

akx commented May 11, 2022

@nateprewitt What is the tracking issue, please? #5871 has been "locked as off-topic", #5797 has been "locked as resolved".

I'm curious what the perceived non-robustness here is, because I do believe we could just drop the default chardet/charset_normalizer dependency and most of our users would be none the wiser.

Encodings on the web

I wrote a program to get the top-level pages of the domains in the TRANCO list to figure out the lay of the land encoding-wise. I filtered to the top 10000 or so domains because of a lack of patience, and the results are as follows:

Out of 5938 pages, there are 10 pages (0.17%) where an encoding wasn't (correctly) declared.
With binary responses filtered out, there are 7 such pages (0.12%):

URL Content-Type Remarks
http://micpn.com/ (no content-type) ASCII error page
https://grubhub.com/ (no content-type) ASCII error page telling me Grubhub doesn't do Finland
https://mob.com/ (no content-type) utf-8 HTML
https://sberbank.ru/ text/html; charset=utf-8, text/html malformed content-type header, content utf-8
https://sbrf.ru/ text/html; charset=utf-8, text/html malformed content-type header, content utf-8
https://unicode.org/ (no content-type) (ironically enough) serves an ASCII redirection page to home.unicode.org
https://www.fuse-cloud.com/ (no content-type) returns invalid request

For the rest, the following encodings were declared (via fallback meaning gleaned with requests' internal fallback for text:

Encoding Count %
utf-8 4918 82.82%
iso-8859-1 (via fallback) 938 15.80%
iso-8859-1 27 0.45%
windows-1251 18 0.30%
gbk 9 0.15%
euc-jp 6 0.10%
gb2312 4 0.07%
iso-8859-2 4 0.07%
iso-8859-15 3 0.05%
shift_jis 3 0.05%
euc-kr 2 0.03%
us-ascii 1 0.02%
windows-1250 1 0.02%
windows-1256 1 0.02%
utf-8 (via fallback) 1 0.02%
windows-31j 1 0.02%
big-5 1 0.02%

The top 3 encodings account for 99.07% of the pages analyzed.

The key takeaway that is In this corpus, all text pages will be decodable by my proposed .text implementation, either by way of the explicit .encoding read from headers, or implied UTF-8. And again, should decoding fail, the error message guides the user to install either detector package and to try again.

UTF-8 misinterpretations

I also believe it's quite unlikely for the UTF-8 codec to successfully misinterpret a string.

To that end, I wrote a small test bench that encodes random test sequences with random codecs and attempts to decode them as UTF-8, recording how many misinterpretations there are.

After 2 million attempts, there were about 22663 misinterpretations (and 41038 correct interpretations!), and by far the top "misbehaving" encoding is UTF-7 (20669 out of 22663 misinterpretations), which is esoteric in its own right, but also very esoteric on the web: 0.012%. The other top misbehavers were UTF-16 and UTF-32, which are in the same esotericity league on the web. (None of these codecs appear in the TRANCO corpus.)

@akx
Copy link
Author

akx commented May 30, 2022

Hi @nateprewitt – sorry for the extra ping, but could you take a look at the above comment?

@nateprewitt
Copy link
Member

Hi @akx,

The tracking issue is still #5871 as noted in the last comment and our state of the change is in the first bullet point.

Yes, it [character detection] should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.

You'll find the research you've done is very similar to what was performed in the original issue by Apache [1][2]. There are however issues with this approach.

The fact you noticed a difference between chardet and charset-normalizer to begin with demonstrates the sampling approach is insufficient. Trying to extrapolate behavior across billions of websites to the 10000 most used (and best behaved) is extremely lossy. It's also failing to account for usage within private networks where Requests is routinely used for internal tooling.

Our key takeaway from the charset-normalizer migration is we cannot accurately gauge usage on this and there is no "correct" answer on how we handle it. That's where we arrived at the functionality just doesn't belong in Requests as a default. The best path forward is deferring to the end user for how they'd like to handle decoding. text may eventually choose to fallback to utf-8/ascii, but that's not a change we're looking to make at this time.

And again, should decoding fail, the error message guides the user to install either detector package and to try again.

This is a hard no from us. We spent months doing analysis on impact for the first change and very reluctantly made it due to the impact to the broader Apache ecosystem. Despite best efforts, it was still more visible that we'd have liked and further ingrained these changes are not safe for a project of Requests size. We won't intentionally make a more significant breaking change like this without a major version bump.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants