-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make chardet/charset_normalizer optional? #5871
Comments
|
Hi @akx Hopefully, this change will benefit the people who actually depend on charset detection. |
>>> import charset_normalizer
>>> charset_normalizer.detect(b'{}')
/home/tseaver/projects/agendaless/Google/src/python-cloud-core/.nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> b'{}'.decode('utf_16_be')
'筽' Note that >>> print(charset_normalizer.detect.__doc__)
chardet legacy method
Detect the encoding of the given byte string. It should be mostly backward-compatible.
Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
This function is deprecated and should be used to migrate your project easily, consult the documentation for
further information. Not planned for removal.
:param byte_str: The byte sequence to examine.
|
Thanks @tseaver, I believe this is something we'd called out in initial testing. The same issue happened for large strings containing only numbers but was seemingly randomly categorized as utf-16. I was under the impression this had been resolved though. @Ousret can you take a look at this when you have a moment? |
Hi @tseaver Thanks for the report, I have seen that you opened an issue in |
As I said, most people end up far more frustrated by this "feature" than helped by it. |
The change causes issues on my side as well. Seems
For the moment I am using the workaround described in your release note. Will open an issue in |
I took the liberty of implementing a version of what I drafted out in the original post in PR #5875. Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases. |
Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since |
@a-maliarov can you explain yourself with the ‘work consistent on Windows but not Debian’ ? Need concrete case. |
@Ousret hi, I think it would be better to post my specific issue within |
I am still not sure how it impacted me, but I had an encoding issue when upgrading requests to 2.26 with my pdf generation. As far as I can tell, the lib I am using (xhtml2pdf) does not use requests or chardet/charset_normalizer directly. The code is simply: html = 'html unicode string'
input_file = io.BytesIO(html.encode('utf-8'))
temp_file = NamedTemporaryFile('wb', delete=False)
xhtml2pdf.pisa.CreatePDF(input_file, temp_file, encoding='utf8') I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1. |
You can simply install If not, then it means that it was a different chnage in 2.26.0 that impacted you. |
Thanks, it works with |
Using the charset-normalizer was done in backwards-compatible way to give people who implicitly depend on results of Simply - if chardet is installed, it will be used instead of charset-normalizer. It's known "property" of charset-normalizer that it sometimes might produce different results than chardet when encoding is guessed from content. Both Chardet and Charset-normalizer use some kind of heuristics to determine that and they both use different optimisations and shortcuts to make this guess "fast". So when @ashb implemented the change he thought about users like you who might somehow depend on the way how chardet detects it and if chardet is installed, it will be used. The Gist of the change was that chardet should not be a "required" dependency because of the licencing is used. So it's not mandatory for requests. But if you install it as an optional extra or manually, that's fine (and to keep backwards compatibiliy it will be used as optional component). Thanks to that, the LGPL license (as being optional) does not limit the users of requests in redistributing their code and their users to redistribute it further. |
Yes I understand all that, what I don't is why does it impact
Do Thanks for taking the time to explain all of that. |
@Gagaro |
Ah. So requests is not the only one with optional chardet dependency :) |
Indeed, nice catch! So we actually depend on |
I just upgraded from 2.25.1 to 2.26.0 and my logs now fill up with charset_normalizer lines, Below is an excerp from a response from a keycloak(redhats OIDC OAUTH2 server) jwt token:
|
Just install And yeah. I think requests maintainer want to remove both in the future. |
``` > hass-cli template <(echo '{{ states.lock.front_door }}') warning: Trying to detect encoding from a tiny portion of (4) byte(s). ascii passed initial chaos probing. Mean measured chaos is 0.000000 % ascii should target any language(s) of ['Latin Based'] ascii is most likely the one. Stopping the process. None ```
I also see performance degradation after moving from 2.25.1 to higher, this just bloats requests. with This call resulting in this have the following header:
So the first thing I do in my logging module is:
|
@fenchu this isn't really relevant to this issue and alternatives to not use charset_normalizer have already been provided. There are also multiple ways to disable the use of character detection entirely. You've explicitly chosen the one API that provides this feature. Reposting your grievance repeatedly isn't furthering the conversation here. So to recap the thread for future readers:
|
With a routine version bump of requirements, I noticed
chardet
had been switched out forcharset_normalizer
(which I had never heard of before) in #5797, apparently due to LGPL license concerns.I agree with @sigmavirus24's comment #5797 (comment) that it's strange for something as central in the Python ecosystem as
requests
is (45k stars, 8k forks, many contributors at the time of writing) to switch to such a relatively unknown and unproven library (132 stars, 5 forks, 2 contributors) for a hard dependency in something as central in the Python ecosystem asrequests
is.The release notes say you could use
pip install "requests[use_chardet_on_py3]"
to usechardet
instead ofcharset_normalizer
, but with that extra set both libraries get installed.I would imagine many users don't really necessarily need the charset detection features in Requests; could we open a discussion on making both
chardet
/charset_normalizer
optional, á larequests[chardet]
orrequests[charset_normalizer]
?AFAICS, the only place where
chardet
is actually used inrequests
isResponse.apparent_encoding
, which is used byResponse.text
when there is no determined encoding.Maybe
apparent_encoding
could try tochardet
orcharset_normalizer
is installed, warn the user ("No encoding detection library is installed. Falling back to XXXX. Please see YYYY for instructions" or somesuch) and return e.g.ascii
The text was updated successfully, but these errors were encountered: