Make chardet/charset_normalizer optional? #5871

akx · 2021-07-14T12:34:54Z

With a routine version bump of requirements, I noticed chardet had been switched out for charset_normalizer (which I had never heard of before) in #5797, apparently due to LGPL license concerns.

I agree with @sigmavirus24's comment #5797 (comment) that it's strange for something as central in the Python ecosystem as requests is (45k stars, 8k forks, many contributors at the time of writing) to switch to such a relatively unknown and unproven library (132 stars, 5 forks, 2 contributors) for a hard dependency in something as central in the Python ecosystem as requests is.

The release notes say you could use pip install "requests[use_chardet_on_py3]" to use chardet instead of charset_normalizer, but with that extra set both libraries get installed.

I would imagine many users don't really necessarily need the charset detection features in Requests; could we open a discussion on making both chardet/charset_normalizer optional, á la requests[chardet] or requests[charset_normalizer]?

AFAICS, the only place where chardet is actually used in requests is Response.apparent_encoding, which is used by Response.text when there is no determined encoding.

Maybe apparent_encoding could try to

as a built-in first attempt, try decoding the content as UTF-8 (which would likely be successful for many cases)
if neither chardet or charset_normalizer is installed, warn the user ("No encoding detection library is installed. Falling back to XXXX. Please see YYYY for instructions" or somesuch) and return e.g. ascii
use either chardet library as per usual

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2021-07-14T12:37:11Z

apparent_encoding genuinely just needs to go away. That can't be done until a major release. Once that happens, we don't need dependencies on either library

Ousret · 2021-07-14T15:30:41Z

Hi @akx

Hopefully, this change will benefit the people who actually depend on charset detection.

tseaver · 2021-07-14T17:45:33Z

charset_normalizer cannot correctly handle some very normal entries, e.g., an empty JSON response:

>>> import charset_normalizer
>>> charset_normalizer.detect(b'{}')
/home/tseaver/projects/agendaless/Google/src/python-cloud-core/.nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
  warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> b'{}'.decode('utf_16_be')
'筽'

Note that charset_normalizer.detect is actually deprecated:

>>> print(charset_normalizer.detect.__doc__)

    chardet legacy method
    Detect the encoding of the given byte string. It should be mostly backward-compatible.
    Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
    This function is deprecated and should be used to migrate your project easily, consult the documentation for
    further information. Not planned for removal.

    :param byte_str:     The byte sequence to examine.

nateprewitt · 2021-07-14T17:50:13Z

Thanks @tseaver, I believe this is something we'd called out in initial testing. The same issue happened for large strings containing only numbers but was seemingly randomly categorized as utf-16. I was under the impression this had been resolved though.

@Ousret can you take a look at this when you have a moment?

Ousret · 2021-07-14T18:20:55Z

Hi @tseaver

Thanks for the report, I have seen that you opened an issue in charset_normalizer. I will pursue this in the issue.
@nateprewitt of course.

sigmavirus24 · 2021-07-15T00:07:04Z

Hopefully, this change will benefit the people who actually depend on charset detection.

As I said, most people end up far more frustrated by this "feature" than helped by it.

grindsa · 2021-07-15T05:41:12Z

The change causes issues on my side as well. Seems charset_normalizer does not detect below ascii string correctly (even when using the newest 2.0.2 version)

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

For the moment I am using the workaround described in your release note. Will open an issue in charset_normalizer.

See psf#5871

akx · 2021-07-15T12:29:06Z

I took the liberty of implementing a version of what I drafted out in the original post in PR #5875.

Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.

a-maliarov · 2021-09-06T13:49:49Z

Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since charset_normalizer is still "young". For example, it doesn't work properly on Debian in some cases, while is quite consistent on Windows.

Ousret · 2021-09-06T15:24:37Z

@a-maliarov can you explain yourself with the ‘work consistent on Windows but not Debian’ ? Need concrete case.

a-maliarov · 2021-09-06T15:37:00Z

@Ousret hi, I think it would be better to post my specific issue within charset_normalizer's repository. Will do later.

Gagaro · 2021-09-22T09:26:39Z

I am still not sure how it impacted me, but I had an encoding issue when upgrading requests to 2.26 with my pdf generation. As far as I can tell, the lib I am using (xhtml2pdf) does not use requests or chardet/charset_normalizer directly.

The code is simply:

html = 'html unicode string'
input_file = io.BytesIO(html.encode('utf-8'))
temp_file = NamedTemporaryFile('wb', delete=False)
xhtml2pdf.pisa.CreatePDF(input_file, temp_file, encoding='utf8')

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

potiuk · 2021-09-22T10:16:20Z

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

You can simply install chardet @Gagaro and it will be fully backwards-compatible (or use the right extra when installing requests)
In your requirements add requests as requests[use_chardet_on_py3]==2.26.0 or install it with pip install "requests[use_chardet_on_py3]==2.26.0" - this should solve the problem entirely.

If not, then it means that it was a different chnage in 2.26.0 that impacted you.

Gagaro · 2021-09-22T11:20:43Z

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

potiuk · 2021-09-22T11:33:26Z

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

Using the charset-normalizer was done in backwards-compatible way to give people who implicitly depend on results of chardet an easy way to switch back to use chardet.

Simply - if chardet is installed, it will be used instead of charset-normalizer.

It's known "property" of charset-normalizer that it sometimes might produce different results than chardet when encoding is guessed from content. Both Chardet and Charset-normalizer use some kind of heuristics to determine that and they both use different optimisations and shortcuts to make this guess "fast".

So when @ashb implemented the change he thought about users like you who might somehow depend on the way how chardet detects it and if chardet is installed, it will be used. The Gist of the change was that chardet should not be a "required" dependency because of the licencing is used. So it's not mandatory for requests. But if you install it as an optional extra or manually, that's fine (and to keep backwards compatibiliy it will be used as optional component). Thanks to that, the LGPL license (as being optional) does not limit the users of requests in redistributing their code and their users to redistribute it further.

Gagaro · 2021-09-22T12:14:12Z

Yes I understand all that, what I don't is why does it impact xhtml2pdf which does not depend on requests / chardet / charset-normalizer :

xhtml2pdf==0.2.4
  - html5lib [required: >=1.0, installed: 1.1]
    - six [required: >=1.9, installed: 1.15.0]
    - webencodings [required: Any, installed: 0.5.1]
  - Pillow [required: Any, installed: 7.2.0]
  - pyPdf2 [required: Any, installed: 1.26.0]
  - reportlab [required: >=3.0, installed: 3.5.46]
    - pillow [required: >=4.0.0, installed: 7.2.0]
  - six [required: Any, installed: 1.15.0]

Do chardet or charset-normalizer modify some things globally which could explain the side effects?

Thanks for taking the time to explain all of that.

akx · 2021-09-22T12:28:28Z

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

https://github.com/html5lib/html5lib-python/blob/f7cab6f019ce94a1ec0192b6ff29aaebaf10b50d/requirements-optional.txt#L7-L9

potiuk · 2021-09-22T12:30:23Z

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

Ah. So requests is not the only one with optional chardet dependency :)

Gagaro · 2021-09-22T13:02:38Z

Indeed, nice catch! So we actually depend on chardet without even knowing it 😅 .

fenchu · 2021-11-26T15:26:35Z

I just upgraded from 2.25.1 to 2.26.0 and my logs now fill up with charset_normalizer lines,
Tons of them, I can fix it if I add explicit headers like 'Content-Type': 'application/scim+json; charset=utf-8' to rest-apis and requests, but what about all the rest-apis out there without charset in characters, what is the workaround here.
I really do not see any reason for chardet or charset_normalizer at all in requests, just crash if we fail to decode so we can fix it.
it is just bloating the library, My initial tests also show that it is slower than previous version when using it with jwt tokens and base64 encoding.

Below is an excerp from a response from a keycloak(redhats OIDC OAUTH2 server) jwt token:

20211126154014.190|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.203|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.245|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (316 byte(s) given) parameters.
20211126154015.964|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (3 byte(s) 
given) parameters.

potiuk · 2021-11-26T15:41:31Z

Just install chardet.

And yeah. I think requests maintainer want to remove both in the future.

``` > hass-cli template <(echo '{{ states.lock.front_door }}') warning: Trying to detect encoding from a tiny portion of (4) byte(s). ascii passed initial chaos probing. Mean measured chaos is 0.000000 % ascii should target any language(s) of ['Latin Based'] ascii is most likely the one. Stopping the process. None ```

fenchu · 2022-01-13T14:00:14Z

I also see performance degradation after moving from 2.25.1 to higher, this just bloats requests.

with requests==2.27.1 I is now even worse, info in this module is like DEBUG.

This call resulting in this have the following header:

    headers = {
        'Accept': 'application/scim+json',
        'Content-Type': 'application/scim+json; charset=utf-8',
        'Authorization': f"Bearer {access_token}"
    }

20220113145635.412|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.415|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.416|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.417|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.538|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.541|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.303|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.304|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.307|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.308|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.

So the first thing I do in my logging module is:

logging.getLogger('charset_normalizer').disabled = True

nateprewitt · 2022-01-13T15:08:04Z

@fenchu this isn't really relevant to this issue and alternatives to not use charset_normalizer have already been provided. There are also multiple ways to disable the use of character detection entirely. You've explicitly chosen the one API that provides this feature. Reposting your grievance repeatedly isn't furthering the conversation here.

So to recap the thread for future readers:

Yes, it should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.
Installing chardet manually or using requests[use_chardet_on_py3] will have the exact same functionality in 2.27.1 as it did in 2.25.1.
content and iter_content both expose the data returned by the response without any attempted character encoding. If you don't think the character encoding is correct and don't want to use an auto-encoder, you can either manually encode the content bytes or set the encoding attribute on the Response before calling .text.

arcondello mentioned this issue Jul 14, 2021

Use chardet rather than charset_normalizer dwavesystems/dwave-cloud-client#471

Merged

akx added a commit to akx/requests that referenced this issue Jul 15, 2021

Make chardet/charset_normalizer an optional dependency

cf3fbf0

See psf#5871

akx mentioned this issue Jul 15, 2021

Make chardet/charset_normalizer optional dependencies #5875

Closed

randomir mentioned this issue Aug 19, 2021

Don't install requests with the use_chardet_on_py3 extra dwavesystems/dwave-cloud-client#476

Merged

nateprewitt added the Feature Request label Aug 28, 2021

Ousret mentioned this issue Sep 16, 2021

[PLACEHOLDER] Possible unidentified different behaviour in specific env jawah/charset_normalizer#102

Closed

psf locked as off-topic and limited conversation to collaborators Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make chardet/charset_normalizer optional? #5871

Make chardet/charset_normalizer optional? #5871

akx commented Jul 14, 2021

sigmavirus24 commented Jul 14, 2021

Ousret commented Jul 14, 2021

tseaver commented Jul 14, 2021 •

edited

Loading

nateprewitt commented Jul 14, 2021

Ousret commented Jul 14, 2021

sigmavirus24 commented Jul 15, 2021

grindsa commented Jul 15, 2021 •

edited

Loading

akx commented Jul 15, 2021

a-maliarov commented Sep 6, 2021

Ousret commented Sep 6, 2021

a-maliarov commented Sep 6, 2021

Gagaro commented Sep 22, 2021

potiuk commented Sep 22, 2021 •

edited

Loading

Gagaro commented Sep 22, 2021

potiuk commented Sep 22, 2021

Gagaro commented Sep 22, 2021

akx commented Sep 22, 2021

potiuk commented Sep 22, 2021

Gagaro commented Sep 22, 2021

fenchu commented Nov 26, 2021

potiuk commented Nov 26, 2021

fenchu commented Jan 13, 2022

nateprewitt commented Jan 13, 2022 •

edited

Loading

Make chardet/charset_normalizer optional? #5871

Make chardet/charset_normalizer optional? #5871

Comments

akx commented Jul 14, 2021

sigmavirus24 commented Jul 14, 2021

Ousret commented Jul 14, 2021

tseaver commented Jul 14, 2021 • edited Loading

nateprewitt commented Jul 14, 2021

Ousret commented Jul 14, 2021

sigmavirus24 commented Jul 15, 2021

grindsa commented Jul 15, 2021 • edited Loading

akx commented Jul 15, 2021

a-maliarov commented Sep 6, 2021

Ousret commented Sep 6, 2021

a-maliarov commented Sep 6, 2021

Gagaro commented Sep 22, 2021

potiuk commented Sep 22, 2021 • edited Loading

Gagaro commented Sep 22, 2021

potiuk commented Sep 22, 2021

Gagaro commented Sep 22, 2021

akx commented Sep 22, 2021

potiuk commented Sep 22, 2021

Gagaro commented Sep 22, 2021

fenchu commented Nov 26, 2021

potiuk commented Nov 26, 2021

fenchu commented Jan 13, 2022

nateprewitt commented Jan 13, 2022 • edited Loading

tseaver commented Jul 14, 2021 •

edited

Loading

grindsa commented Jul 15, 2021 •

edited

Loading

potiuk commented Sep 22, 2021 •

edited

Loading

nateprewitt commented Jan 13, 2022 •

edited

Loading