`from_bytes` vs CLI results #180

oleksandr-kuzmenko · 2022-04-27T20:03:00Z

Describe the bug
Encoding from from_bytes method is different from encoding from CLI usage with the same file

To Reproduce
The repository with full example: https://github.com/oleksandr-kuzmenko/charset-normalizer-test

Desktop:

OS: macOS 12.3
Python version: 3.9.12
Package version: 2.0.12

The text was updated successfully, but these errors were encountered:

Ousret · 2022-04-28T04:42:11Z

Thanks for the detailed issue report.
I can confirm there is something wrong.

Debugging

from charset_normalizer import from_path

if __name__ == "__main__":
    results = from_path("./file.xml", cp_isolation=['cp1251', 'mac_greek'], explain=True)

2022-04-28 06:30:31,635 | Level 5 | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : cp1251, mac_greek.
2022-04-28 06:30:31,635 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1251.
2022-04-28 06:30:31,644 | Level 5 | cp1251 passed initial chaos probing. Mean measured chaos is 13.780000 %
2022-04-28 06:30:31,646 | Level 5 | cp1251 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2022-04-28 06:30:31,650 | Level 5 | We detected language [('Russian', 0.9138), ('Ukrainian', 0.8923), ('Bulgarian', 0.8372), ('Serbian', 0.748)] using cp1251
2022-04-28 06:30:31,660 | Level 5 | mac_greek passed initial chaos probing. Mean measured chaos is 10.140000 %
2022-04-28 06:30:31,660 | Level 5 | mac_greek should target any language(s) of ['Greek']
2022-04-28 06:30:31,663 | Level 5 | We detected language [('Greek', 0.7196)] using mac_greek
2022-04-28 06:30:31,664 | DEBUG | Encoding detection: Found mac_greek as plausible (best-candidate) for content. With 1 alternatives.

Default diverging between CLI and code

If you look at,
https://github.com/Ousret/charset_normalizer/blob/master/charset_normalizer/cli/normalizer.py#L114

The default assigned for threshold is 0.1 while its 0.2 in all charset_normalizer.api functions.

Manually setting it to 0.2 fix the difference. I can accept a PR that aligns the CLI default threshold with the current 0.2.

Why does it say its mac_greek

Simply said, the mess-detector does not like ФИОРуководителяОрганизации while it is certainly valid.
I guess the only way to force the MD to accept those long words is to teach it to recognize camelCased words.
So, there, a patch for https://github.com/Ousret/charset_normalizer/blob/master/charset_normalizer/md.py#L252 will be worked on.

oleksandr-kuzmenko · 2022-04-28T12:10:15Z

Thanks for your reply.
RP has been sent.
I'll try to use threshold=0.1 to get the correct encoding.

oleksandr-kuzmenko added bug Something isn't working help wanted Extra attention is needed labels Apr 27, 2022

oleksandr-kuzmenko mentioned this issue Apr 28, 2022

CLI default threshold aligned with the API threshold #181

Merged

oleksandr-kuzmenko closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`from_bytes` vs CLI results #180

`from_bytes` vs CLI results #180

oleksandr-kuzmenko commented Apr 27, 2022 •

edited

Loading

Ousret commented Apr 28, 2022

oleksandr-kuzmenko commented Apr 28, 2022 •

edited

Loading

from_bytes vs CLI results #180

from_bytes vs CLI results #180

Comments

oleksandr-kuzmenko commented Apr 27, 2022 • edited Loading

Ousret commented Apr 28, 2022

Debugging

Default diverging between CLI and code

Why does it say its mac_greek

oleksandr-kuzmenko commented Apr 28, 2022 • edited Loading

`from_bytes` vs CLI results #180

`from_bytes` vs CLI results #180

oleksandr-kuzmenko commented Apr 27, 2022 •

edited

Loading

oleksandr-kuzmenko commented Apr 28, 2022 •

edited

Loading