Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_bytes vs CLI results #180

Closed
oleksandr-kuzmenko opened this issue Apr 27, 2022 · 2 comments
Closed

from_bytes vs CLI results #180

oleksandr-kuzmenko opened this issue Apr 27, 2022 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@oleksandr-kuzmenko
Copy link
Contributor

oleksandr-kuzmenko commented Apr 27, 2022

Describe the bug
Encoding from from_bytes method is different from encoding from CLI usage with the same file

To Reproduce
The repository with full example: https://github.com/oleksandr-kuzmenko/charset-normalizer-test

Desktop:

  • OS: macOS 12.3
  • Python version: 3.9.12
  • Package version: 2.0.12
@oleksandr-kuzmenko oleksandr-kuzmenko added bug Something isn't working help wanted Extra attention is needed labels Apr 27, 2022
@Ousret
Copy link
Member

Ousret commented Apr 28, 2022

Thanks for the detailed issue report.
I can confirm there is something wrong.

Debugging

from charset_normalizer import from_path

if __name__ == "__main__":
    results = from_path("./file.xml", cp_isolation=['cp1251', 'mac_greek'], explain=True)
2022-04-28 06:30:31,635 | Level 5 | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : cp1251, mac_greek.
2022-04-28 06:30:31,635 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1251.
2022-04-28 06:30:31,644 | Level 5 | cp1251 passed initial chaos probing. Mean measured chaos is 13.780000 %
2022-04-28 06:30:31,646 | Level 5 | cp1251 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2022-04-28 06:30:31,650 | Level 5 | We detected language [('Russian', 0.9138), ('Ukrainian', 0.8923), ('Bulgarian', 0.8372), ('Serbian', 0.748)] using cp1251
2022-04-28 06:30:31,660 | Level 5 | mac_greek passed initial chaos probing. Mean measured chaos is 10.140000 %
2022-04-28 06:30:31,660 | Level 5 | mac_greek should target any language(s) of ['Greek']
2022-04-28 06:30:31,663 | Level 5 | We detected language [('Greek', 0.7196)] using mac_greek
2022-04-28 06:30:31,664 | DEBUG | Encoding detection: Found mac_greek as plausible (best-candidate) for content. With 1 alternatives.

Default diverging between CLI and code

If you look at,
https://github.com/Ousret/charset_normalizer/blob/master/charset_normalizer/cli/normalizer.py#L114

The default assigned for threshold is 0.1 while its 0.2 in all charset_normalizer.api functions.

Manually setting it to 0.2 fix the difference. I can accept a PR that aligns the CLI default threshold with the current 0.2.

Why does it say its mac_greek

Simply said, the mess-detector does not like ФИОРуководителяОрганизации while it is certainly valid.
I guess the only way to force the MD to accept those long words is to teach it to recognize camelCased words.
So, there, a patch for https://github.com/Ousret/charset_normalizer/blob/master/charset_normalizer/md.py#L252 will be worked on.

@oleksandr-kuzmenko
Copy link
Contributor Author

oleksandr-kuzmenko commented Apr 28, 2022

Thanks for your reply.
RP has been sent.
I'll try to use threshold=0.1 to get the correct encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Development

No branches or pull requests

2 participants