Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] Incorrect natural language detection #200

Closed
cdelledonne opened this issue Jul 19, 2022 · 4 comments
Closed

[DETECTION] Incorrect natural language detection #200

cdelledonne opened this issue Jul 19, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@cdelledonne
Copy link

Notice

I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

2001 A Space Odyssey (1968).it.srt.txt

Verbose output

2022-07-19 11:02:12,521 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc3 in position 851: ordinal not in range(128)
2022-07-19 11:02:12,521 | Level 5 | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-07-19 11:02:12,527 | Level 5 | utf_8 passed initial chaos probing. Mean measured chaos is 4.660000 %
2022-07-19 11:02:12,529 | Level 5 | We detected language [('English', 1.0), ('Dutch', 1.0), ('Italian', 0.9891), ('Spanish', 0.9762), ('Portuguese', 0.9565), ('French', 0.9545), ('German', 0.9295)] using utf_8
2022-07-19 11:02:12,529 | DEBUG | Encoding detection: utf_8 is most likely the one.
{
    "path": "/home/pianetto/storage/media/movies/2001 A Space Odyssey (1968)/2001 A Space Odyssey (1968).it.srt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 4.66,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

I use charset_normalizer mostly to detect language of subtitle files. The results are not those expected on a bunch of files, I just posted one here but let me know if you'd like to have more samples. In this particular case, I'm expecting to get Italian as language, but I get English instead.

Desktop

  • OS: Fedora Linux 36
  • Python version: Python 3.10.5 (but tried with 3.8.13 as well)
  • Package version: 2.1.0 (but tried with 2.0.12 as well)
@cdelledonne cdelledonne added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Jul 19, 2022
@Ousret
Copy link
Member

Ousret commented Jul 19, 2022

Investigation around the natural language detection

This library, by default, extracts five chunks. Of those, here is the characters' analysis debug output.
First, yes, I could reproduce your case.

What the library sees

Chunk 1

DEBUG COMMON [('o', 28), ('i', 22), ('a', 20), ('e', 19), ('r', 17), ('n', 16), ('l', 13), ('t', 9), ('v', 8), ('s', 6), ('g', 6), ('d', 5), ('p', 5), ('b', 5), ('u', 5), ('c', 5), ('m', 4), ('z', 2), ('h', 2), ('f', 1)]
FOUND [('English', 1.0), ('Dutch', 1.0), ('German', 0.95)]

Chunk 2

DEBUG COMMON [('i', 24), ('e', 23), ('o', 17), ('r', 15), ('n', 14), ('g', 11), ('a', 11), ('s', 11), ('t', 10), ('l', 5), ('c', 5), ('z', 4), ('d', 4), ('u', 4), ('m', 4), ('h', 4), ('v', 3), ('p', 3), ('b', 2), ('à', 1), ('f', 1), ('è', 1)]
FOUND [('Italian', 1.0), ('French', 0.9545), ('German', 0.9091)]

Chunk 3

DEBUG COMMON [('i', 45), ('e', 28), ('o', 27), ('n', 21), ('a', 20), ('r', 20), ('s', 17), ('l', 12), ('t', 12), ('m', 8), ('c', 7), ('d', 7), ('z', 6), ('p', 5), ('v', 5), ('b', 4), ('g', 4), ('u', 3), ('q', 2), ('h', 2), ('f', 1), ('é', 1), ('è', 1)]
FOUND [('French', 0.9565), ('Italian', 0.9565), ('Portuguese', 0.9565)]

Chunk 4

DEBUG COMMON [('e', 33), ('i', 33), ('o', 23), ('s', 22), ('a', 18), ('t', 17), ('n', 16), ('u', 11), ('r', 11), ('c', 10), ('m', 9), ('l', 8), ('d', 7), ('p', 6), ('f', 4), ('v', 3), ('q', 3), ('b', 3), ('h', 2), ('è', 1), ('z', 1)]
FOUND [('Italian', 1.0), ('French', 0.9524), ('Spanish', 0.9524)]

Chunk 5

DEBUG COMMON [('o', 31), ('e', 24), ('a', 21), ('n', 18), ('l', 17), ('r', 15), ('s', 13), ('t', 13), ('c', 10), ('i', 9), ('m', 8), ('d', 8), ('u', 6), ('b', 5), ('v', 4), ('p', 2), ('z', 2), ('f', 2), ('q', 1), ('g', 1)]
FOUND [('English', 1.0), ('Italian', 1.0), ('Spanish', 1.0)]

Why does it say that

Based on the characters' frequencies bellow

"English": [
        "e",
        "a",
        "t",
        "i",
        "o",
        "n",
        "s",
        "r",
        "h",
        "l",
        "d",
        "c",
        "u",
        "m",
        "f",
        "p",
        "g",
        "w",
        "y",
        "b",
        "v",
        "k",
        "x",
        "j",
        "z",
        "q",
    ],
...

"Italian": [
        "e",
        "i",
        "a",
        "o",
        "n",
        "l",
        "t",
        "r",
        "s",
        "c",
        "d",
        "u",
        "p",
        "m",
        "g",
        "v",
        "f",
        "b",
        "z",
        "h",
        "q",
        "è",
        "à",
        "k",
        "y",
        "ò",
    ],

AND

def characters_popularity_compare(
    language: str, ordered_characters: List[str]
) -> float:
    """
    Determine if an ordered characters list (by occurrence from most appearances to rarest) matches a particular language.
    The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
    Beware that its function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
    """

We try to compare our extraction with it but NOT in a strict way. That's why the Latin-based lng might be sometime entangled.

The main goal of charset-normalizer is still to offer you the best suiting character encoding.
Natural language detection is a secondary aspect. But still, we may need to find some non-breaking way to improve it.

Being stricter on natural language detection is counterintuitive to our main goal (in most cases).
You may argue that our natural language detection is more inclined toward finding intelligent design first. (I would agree)

What can we do?

My first idea going forward is to patch the function characters_popularity_compare to be a bit less lax.
Or switch to using NGrams but less confident about the performance's outcome.
Or have the function merge_coherence_ratios improved to reflect better the probable language used first?

What can you do immediately?

Using a dedicated natural language (ngram) detector even if it slows down your process. I infer that you use the language to rename the file using the proper LG.SRT where LG = language iso 2-char.
And sharing the complete dataset would help much.

When?

I am not confident tweaking this as of right now, I must do some thorough thinking and planning first. Tho' contributions are
welcomed.

Hope that explains things.

@Ousret Ousret changed the title [DETECTION] Incorrect language detection [DETECTION] Incorrect natural language detection Jul 19, 2022
@Ousret Ousret added enhancement New feature or request and removed detection Related to the charset detection mechanism, chaos/mess/coherence labels Jul 19, 2022
@Ousret
Copy link
Member

Ousret commented Jul 19, 2022

Could you try against your dataset of srt the branch patch-lg-detect-hotfix ?

@Ousret Ousret linked a pull request Jul 19, 2022 that will close this issue
@cdelledonne
Copy link
Author

Thank you so much for the very thorough explanation, I appreciate that.

I tried your new branch and it does produce a correct result for the subtitle file that I shared, but the results for my database of subtitles are all over the place unfortunately. Here's my complete database of subtitles, hope it can be of some help:
subsdb.zip

Most subtitles have a two-letter language code in their name, so it's easy to verify the correctness of this library against those files. Other files don't have a language code, and as you guessed I was hoping to use this library to rename them.

Ousret added a commit that referenced this issue Oct 18, 2022
Improve the condition on issue #200
@Ousret
Copy link
Member

Ousret commented Oct 18, 2022

I have improved the language detector in v3.0 (rc1),
still, it is not as good as a dedicated language detector (ngrams) and will (likely) will never be.
I consider this issue to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

2 participants