-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make chardet/charset_normalizer optional dependencies #5875
Conversation
I think we're going to close this out given the outcome of tracking issue. The long term goal is to remove character detection from Requests. I'm not sure this solution is robust enough and is likely to introduce further unnecessary breakages at this point. We'll track progress on the original issue. Thanks! |
@nateprewitt What is the tracking issue, please? #5871 has been "locked as off-topic", #5797 has been "locked as resolved". I'm curious what the perceived non-robustness here is, because I do believe we could just drop the default chardet/charset_normalizer dependency and most of our users would be none the wiser. Encodings on the webI wrote a program to get the top-level pages of the domains in the TRANCO list to figure out the lay of the land encoding-wise. I filtered to the top 10000 or so domains because of a lack of patience, and the results are as follows: Out of 5938 pages, there are 10 pages (0.17%) where an encoding wasn't (correctly) declared.
For the rest, the following encodings were declared (
The top 3 encodings account for 99.07% of the pages analyzed. The key takeaway that is In this corpus, all text pages will be decodable by my proposed UTF-8 misinterpretationsI also believe it's quite unlikely for the UTF-8 codec to successfully misinterpret a string. To that end, I wrote a small test bench that encodes random test sequences with random codecs and attempts to decode them as UTF-8, recording how many misinterpretations there are. After 2 million attempts, there were about 22663 misinterpretations (and 41038 correct interpretations!), and by far the top "misbehaving" encoding is UTF-7 (20669 out of 22663 misinterpretations), which is esoteric in its own right, but also very esoteric on the web: 0.012%. The other top misbehavers were UTF-16 and UTF-32, which are in the same esotericity league on the web. (None of these codecs appear in the TRANCO corpus.) |
Hi @nateprewitt – sorry for the extra ping, but could you take a look at the above comment? |
Hi @akx, The tracking issue is still #5871 as noted in the last comment and our state of the change is in the first bullet point.
You'll find the research you've done is very similar to what was performed in the original issue by Apache [1][2]. There are however issues with this approach. The fact you noticed a difference between chardet and charset-normalizer to begin with demonstrates the sampling approach is insufficient. Trying to extrapolate behavior across billions of websites to the 10000 most used (and best behaved) is extremely lossy. It's also failing to account for usage within private networks where Requests is routinely used for internal tooling. Our key takeaway from the charset-normalizer migration is we cannot accurately gauge usage on this and there is no "correct" answer on how we handle it. That's where we arrived at the functionality just doesn't belong in Requests as a default. The best path forward is deferring to the end user for how they'd like to handle decoding.
This is a hard no from us. We spent months doing analysis on impact for the first change and very reluctantly made it due to the impact to the broader Apache ecosystem. Despite best efforts, it was still more visible that we'd have liked and further ingrained these changes are not safe for a project of Requests size. We won't intentionally make a more significant breaking change like this without a major version bump. |
An implementation of #5871 for consideration.
Unlike my suggestion in #5871, this implementation prefers
chardet
/charset_normalizer
if it's available, with the idea that those libraries should hopefully do a better job than trying to decode the payload in the two most likely encodings.Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.
This is still missing documentation, changelogs, etc. – help appreciated.
Considering the switchover to
charset_normalizer
has broken some users' workflows anyway, I think it wouldn't be too much of an additional breaking change to do this without a major version bump.