Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make chardet/charset_normalizer optional? #5871

Open
akx opened this issue Jul 14, 2021 · 23 comments
Open

Make chardet/charset_normalizer optional? #5871

akx opened this issue Jul 14, 2021 · 23 comments

Comments

@akx
Copy link

akx commented Jul 14, 2021

With a routine version bump of requirements, I noticed chardet had been switched out for charset_normalizer (which I had never heard of before) in #5797, apparently due to LGPL license concerns.

I agree with @sigmavirus24's comment #5797 (comment) that it's strange for something as central in the Python ecosystem as requests is (45k stars, 8k forks, many contributors at the time of writing) to switch to such a relatively unknown and unproven library (132 stars, 5 forks, 2 contributors) for a hard dependency in something as central in the Python ecosystem as requests is.

The release notes say you could use pip install "requests[use_chardet_on_py3]" to use chardet instead of charset_normalizer, but with that extra set both libraries get installed.

I would imagine many users don't really necessarily need the charset detection features in Requests; could we open a discussion on making both chardet/charset_normalizer optional, á la requests[chardet] or requests[charset_normalizer]?

AFAICS, the only place where chardet is actually used in requests is Response.apparent_encoding, which is used by Response.text when there is no determined encoding.

Maybe apparent_encoding could try to

  1. as a built-in first attempt, try decoding the content as UTF-8 (which would likely be successful for many cases)
  2. if neither chardet or charset_normalizer is installed, warn the user ("No encoding detection library is installed. Falling back to XXXX. Please see YYYY for instructions" or somesuch) and return e.g. ascii
  3. use either chardet library as per usual
@sigmavirus24
Copy link
Contributor

apparent_encoding genuinely just needs to go away. That can't be done until a major release. Once that happens, we don't need dependencies on either library

@Ousret
Copy link
Contributor

Ousret commented Jul 14, 2021

Hi @akx

Hopefully, this change will benefit the people who actually depend on charset detection.

@tseaver
Copy link

tseaver commented Jul 14, 2021

charset_normalizer cannot correctly handle some very normal entries, e.g., an empty JSON response:

>>> import charset_normalizer
>>> charset_normalizer.detect(b'{}')
/home/tseaver/projects/agendaless/Google/src/python-cloud-core/.nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
  warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> b'{}'.decode('utf_16_be')
'筽'

Note that charset_normalizer.detect is actually deprecated:

>>> print(charset_normalizer.detect.__doc__)

    chardet legacy method
    Detect the encoding of the given byte string. It should be mostly backward-compatible.
    Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
    This function is deprecated and should be used to migrate your project easily, consult the documentation for
    further information. Not planned for removal.

    :param byte_str:     The byte sequence to examine.
    

@nateprewitt
Copy link
Member

Thanks @tseaver, I believe this is something we'd called out in initial testing. The same issue happened for large strings containing only numbers but was seemingly randomly categorized as utf-16. I was under the impression this had been resolved though.

@Ousret can you take a look at this when you have a moment?

@Ousret
Copy link
Contributor

Ousret commented Jul 14, 2021

Hi @tseaver

Thanks for the report, I have seen that you opened an issue in charset_normalizer. I will pursue this in the issue.
@nateprewitt of course.

@sigmavirus24
Copy link
Contributor

Hopefully, this change will benefit the people who actually depend on charset detection.

As I said, most people end up far more frustrated by this "feature" than helped by it.

@grindsa
Copy link

grindsa commented Jul 15, 2021

The change causes issues on my side as well. Seems charset_normalizer does not detect below ascii string correctly (even when using the newest 2.0.2 version)

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

For the moment I am using the workaround described in your release note. Will open an issue in charset_normalizer.

@akx
Copy link
Author

akx commented Jul 15, 2021

I took the liberty of implementing a version of what I drafted out in the original post in PR #5875.

Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.

@a-maliarov
Copy link

Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since charset_normalizer is still "young". For example, it doesn't work properly on Debian in some cases, while is quite consistent on Windows.

@Ousret
Copy link
Contributor

Ousret commented Sep 6, 2021

@a-maliarov can you explain yourself with the ‘work consistent on Windows but not Debian’ ? Need concrete case.

@a-maliarov
Copy link

@Ousret hi, I think it would be better to post my specific issue within charset_normalizer's repository. Will do later.

@Gagaro
Copy link

Gagaro commented Sep 22, 2021

I am still not sure how it impacted me, but I had an encoding issue when upgrading requests to 2.26 with my pdf generation. As far as I can tell, the lib I am using (xhtml2pdf) does not use requests or chardet/charset_normalizer directly.

The code is simply:

html = 'html unicode string'
input_file = io.BytesIO(html.encode('utf-8'))
temp_file = NamedTemporaryFile('wb', delete=False)
xhtml2pdf.pisa.CreatePDF(input_file, temp_file, encoding='utf8')

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

@potiuk
Copy link
Contributor

potiuk commented Sep 22, 2021

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

You can simply install chardet @Gagaro and it will be fully backwards-compatible (or use the right extra when installing requests)
In your requirements add requests as requests[use_chardet_on_py3]==2.26.0 or install it with pip install "requests[use_chardet_on_py3]==2.26.0" - this should solve the problem entirely.

If not, then it means that it was a different chnage in 2.26.0 that impacted you.

@Gagaro
Copy link

Gagaro commented Sep 22, 2021

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

@potiuk
Copy link
Contributor

potiuk commented Sep 22, 2021

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

Using the charset-normalizer was done in backwards-compatible way to give people who implicitly depend on results of chardet an easy way to switch back to use chardet.

Simply - if chardet is installed, it will be used instead of charset-normalizer.

It's known "property" of charset-normalizer that it sometimes might produce different results than chardet when encoding is guessed from content. Both Chardet and Charset-normalizer use some kind of heuristics to determine that and they both use different optimisations and shortcuts to make this guess "fast".

So when @ashb implemented the change he thought about users like you who might somehow depend on the way how chardet detects it and if chardet is installed, it will be used. The Gist of the change was that chardet should not be a "required" dependency because of the licencing is used. So it's not mandatory for requests. But if you install it as an optional extra or manually, that's fine (and to keep backwards compatibiliy it will be used as optional component). Thanks to that, the LGPL license (as being optional) does not limit the users of requests in redistributing their code and their users to redistribute it further.

@Gagaro
Copy link

Gagaro commented Sep 22, 2021

Yes I understand all that, what I don't is why does it impact xhtml2pdf which does not depend on requests / chardet / charset-normalizer :

xhtml2pdf==0.2.4
  - html5lib [required: >=1.0, installed: 1.1]
    - six [required: >=1.9, installed: 1.15.0]
    - webencodings [required: Any, installed: 0.5.1]
  - Pillow [required: Any, installed: 7.2.0]
  - pyPdf2 [required: Any, installed: 1.26.0]
  - reportlab [required: >=3.0, installed: 3.5.46]
    - pillow [required: >=4.0.0, installed: 7.2.0]
  - six [required: Any, installed: 1.15.0]

Do chardet or charset-normalizer modify some things globally which could explain the side effects?

Thanks for taking the time to explain all of that.

@akx
Copy link
Author

akx commented Sep 22, 2021

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

https://github.com/html5lib/html5lib-python/blob/f7cab6f019ce94a1ec0192b6ff29aaebaf10b50d/requirements-optional.txt#L7-L9

@potiuk
Copy link
Contributor

potiuk commented Sep 22, 2021

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

Ah. So requests is not the only one with optional chardet dependency :)

@Gagaro
Copy link

Gagaro commented Sep 22, 2021

Indeed, nice catch! So we actually depend on chardet without even knowing it 😅 .

@fenchu
Copy link

fenchu commented Nov 26, 2021

I just upgraded from 2.25.1 to 2.26.0 and my logs now fill up with charset_normalizer lines,
Tons of them, I can fix it if I add explicit headers like 'Content-Type': 'application/scim+json; charset=utf-8' to rest-apis and requests, but what about all the rest-apis out there without charset in characters, what is the workaround here.
I really do not see any reason for chardet or charset_normalizer at all in requests, just crash if we fail to decode so we can fix it.
it is just bloating the library, My initial tests also show that it is slower than previous version when using it with jwt tokens and base64 encoding.

Below is an excerp from a response from a keycloak(redhats OIDC OAUTH2 server) jwt token:

20211126154014.190|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.203|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.245|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (316 byte(s) given) parameters.
20211126154015.964|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (3 byte(s) 
given) parameters.

@potiuk
Copy link
Contributor

potiuk commented Nov 26, 2021

Just install chardet.

And yeah. I think requests maintainer want to remove both in the future.

joshuaspence added a commit to joshuaspence/dotfiles that referenced this issue Dec 14, 2021
```
> hass-cli template <(echo '{{ states.lock.front_door }}')
warning: Trying to detect encoding from a tiny portion of (4) byte(s).
ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
ascii should target any language(s) of ['Latin Based']
ascii is most likely the one. Stopping the process.
None
```
@fenchu
Copy link

fenchu commented Jan 13, 2022

I also see performance degradation after moving from 2.25.1 to higher, this just bloats requests.

with requests==2.27.1 I is now even worse, info in this module is like DEBUG.

This call resulting in this have the following header:

    headers = {
        'Accept': 'application/scim+json',
        'Content-Type': 'application/scim+json; charset=utf-8',
        'Authorization': f"Bearer {access_token}"
    }
20220113145635.412|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.415|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.416|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.417|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.538|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.541|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.303|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.304|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.307|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.308|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.

So the first thing I do in my logging module is:

logging.getLogger('charset_normalizer').disabled = True

@nateprewitt
Copy link
Member

nateprewitt commented Jan 13, 2022

@fenchu this isn't really relevant to this issue and alternatives to not use charset_normalizer have already been provided. There are also multiple ways to disable the use of character detection entirely. You've explicitly chosen the one API that provides this feature. Reposting your grievance repeatedly isn't furthering the conversation here.

So to recap the thread for future readers:

  1. Yes, it should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.
  2. Installing chardet manually or using requests[use_chardet_on_py3] will have the exact same functionality in 2.27.1 as it did in 2.25.1.
  3. content and iter_content both expose the data returned by the response without any attempted character encoding. If you don't think the character encoding is correct and don't want to use an auto-encoder, you can either manually encode the content bytes or set the encoding attribute on the Response before calling .text.

@psf psf locked as off-topic and limited conversation to collaborators Jan 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants