[DETECTION] Incorrect natural language detection #200

cdelledonne · 2022-07-19T09:07:30Z

Notice

I hereby announce that my raw input is not :

Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

2001 A Space Odyssey (1968).it.srt.txt

Verbose output

2022-07-19 11:02:12,521 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc3 in position 851: ordinal not in range(128)
2022-07-19 11:02:12,521 | Level 5 | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-07-19 11:02:12,527 | Level 5 | utf_8 passed initial chaos probing. Mean measured chaos is 4.660000 %
2022-07-19 11:02:12,529 | Level 5 | We detected language [('English', 1.0), ('Dutch', 1.0), ('Italian', 0.9891), ('Spanish', 0.9762), ('Portuguese', 0.9565), ('French', 0.9545), ('German', 0.9295)] using utf_8
2022-07-19 11:02:12,529 | DEBUG | Encoding detection: utf_8 is most likely the one.
{
    "path": "/home/pianetto/storage/media/movies/2001 A Space Odyssey (1968)/2001 A Space Odyssey (1968).it.srt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 4.66,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

I use charset_normalizer mostly to detect language of subtitle files. The results are not those expected on a bunch of files, I just posted one here but let me know if you'd like to have more samples. In this particular case, I'm expecting to get Italian as language, but I get English instead.

Desktop

OS: Fedora Linux 36
Python version: Python 3.10.5 (but tried with 3.8.13 as well)
Package version: 2.1.0 (but tried with 2.0.12 as well)

The text was updated successfully, but these errors were encountered:

Ousret · 2022-07-19T15:26:53Z

Investigation around the natural language detection

This library, by default, extracts five chunks. Of those, here is the characters' analysis debug output.
First, yes, I could reproduce your case.

What the library sees

Chunk 1

DEBUG COMMON [('o', 28), ('i', 22), ('a', 20), ('e', 19), ('r', 17), ('n', 16), ('l', 13), ('t', 9), ('v', 8), ('s', 6), ('g', 6), ('d', 5), ('p', 5), ('b', 5), ('u', 5), ('c', 5), ('m', 4), ('z', 2), ('h', 2), ('f', 1)]
FOUND [('English', 1.0), ('Dutch', 1.0), ('German', 0.95)]

Chunk 2

DEBUG COMMON [('i', 24), ('e', 23), ('o', 17), ('r', 15), ('n', 14), ('g', 11), ('a', 11), ('s', 11), ('t', 10), ('l', 5), ('c', 5), ('z', 4), ('d', 4), ('u', 4), ('m', 4), ('h', 4), ('v', 3), ('p', 3), ('b', 2), ('à', 1), ('f', 1), ('è', 1)]
FOUND [('Italian', 1.0), ('French', 0.9545), ('German', 0.9091)]

Chunk 3

DEBUG COMMON [('i', 45), ('e', 28), ('o', 27), ('n', 21), ('a', 20), ('r', 20), ('s', 17), ('l', 12), ('t', 12), ('m', 8), ('c', 7), ('d', 7), ('z', 6), ('p', 5), ('v', 5), ('b', 4), ('g', 4), ('u', 3), ('q', 2), ('h', 2), ('f', 1), ('é', 1), ('è', 1)]
FOUND [('French', 0.9565), ('Italian', 0.9565), ('Portuguese', 0.9565)]

Chunk 4

DEBUG COMMON [('e', 33), ('i', 33), ('o', 23), ('s', 22), ('a', 18), ('t', 17), ('n', 16), ('u', 11), ('r', 11), ('c', 10), ('m', 9), ('l', 8), ('d', 7), ('p', 6), ('f', 4), ('v', 3), ('q', 3), ('b', 3), ('h', 2), ('è', 1), ('z', 1)]
FOUND [('Italian', 1.0), ('French', 0.9524), ('Spanish', 0.9524)]

Chunk 5

DEBUG COMMON [('o', 31), ('e', 24), ('a', 21), ('n', 18), ('l', 17), ('r', 15), ('s', 13), ('t', 13), ('c', 10), ('i', 9), ('m', 8), ('d', 8), ('u', 6), ('b', 5), ('v', 4), ('p', 2), ('z', 2), ('f', 2), ('q', 1), ('g', 1)]
FOUND [('English', 1.0), ('Italian', 1.0), ('Spanish', 1.0)]

Why does it say that

Based on the characters' frequencies bellow

"English": [
        "e",
        "a",
        "t",
        "i",
        "o",
        "n",
        "s",
        "r",
        "h",
        "l",
        "d",
        "c",
        "u",
        "m",
        "f",
        "p",
        "g",
        "w",
        "y",
        "b",
        "v",
        "k",
        "x",
        "j",
        "z",
        "q",
    ],
...

"Italian": [
        "e",
        "i",
        "a",
        "o",
        "n",
        "l",
        "t",
        "r",
        "s",
        "c",
        "d",
        "u",
        "p",
        "m",
        "g",
        "v",
        "f",
        "b",
        "z",
        "h",
        "q",
        "è",
        "à",
        "k",
        "y",
        "ò",
    ],

AND

def characters_popularity_compare(
    language: str, ordered_characters: List[str]
) -> float:
    """
    Determine if an ordered characters list (by occurrence from most appearances to rarest) matches a particular language.
    The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
    Beware that its function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
    """

We try to compare our extraction with it but NOT in a strict way. That's why the Latin-based lng might be sometime entangled.

The main goal of charset-normalizer is still to offer you the best suiting character encoding.
Natural language detection is a secondary aspect. But still, we may need to find some non-breaking way to improve it.

Being stricter on natural language detection is counterintuitive to our main goal (in most cases).
You may argue that our natural language detection is more inclined toward finding intelligent design first. (I would agree)

What can we do?

My first idea going forward is to patch the function characters_popularity_compare to be a bit less lax.
Or switch to using NGrams but less confident about the performance's outcome.
Or have the function merge_coherence_ratios improved to reflect better the probable language used first?

What can you do immediately?

Using a dedicated natural language (ngram) detector even if it slows down your process. I infer that you use the language to rename the file using the proper LG.SRT where LG = language iso 2-char.
And sharing the complete dataset would help much.

When?

I am not confident tweaking this as of right now, I must do some thorough thinking and planning first. Tho' contributions are
welcomed.

Hope that explains things.

Ousret · 2022-07-19T15:56:13Z

Could you try against your dataset of srt the branch patch-lg-detect-hotfix ?

cdelledonne · 2022-07-20T08:01:21Z

Thank you so much for the very thorough explanation, I appreciate that.

I tried your new branch and it does produce a correct result for the subtitle file that I shared, but the results for my database of subtitles are all over the place unfortunately. Here's my complete database of subtitles, hope it can be of some help:
subsdb.zip

Most subtitles have a two-letter language code in their name, so it's easy to verify the correctness of this library against those files. Other files don't have a language code, and as you guessed I was hoping to use this library to rename them.

Improve the condition on issue #200

Ousret · 2022-10-18T19:09:15Z

I have improved the language detector in v3.0 (rc1),
still, it is not as good as a dedicated language detector (ngrams) and will (likely) will never be.
I consider this issue to be addressed.

cdelledonne added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Jul 19, 2022

Ousret changed the title ~~[DETECTION] Incorrect language detection~~ [DETECTION] Incorrect natural language detection Jul 19, 2022

Ousret added enhancement New feature or request and removed detection Related to the charset detection mechanism, chaos/mess/coherence labels Jul 19, 2022

Ousret mentioned this issue Jul 19, 2022

❇️ Attempts to improve the natural language detection #201

Closed

Ousret linked a pull request Jul 19, 2022 that will close this issue

❇️ Attempts to improve the natural language detection #201

Closed

Ousret added a commit that referenced this issue Oct 18, 2022

🔧 Make the language detection stricter

14689be

Improve the condition on issue #200

Ousret closed this as completed Oct 18, 2022

Ousret removed the help wanted Extra attention is needed label Oct 18, 2022

cdelledonne mentioned this issue Oct 23, 2022

Improve language detection using langid mdcollins05/srt-lang-detect#9

Merged

Ousret mentioned this issue Nov 30, 2023

[Proposal] allow to disable language detection #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DETECTION] Incorrect natural language detection #200

[DETECTION] Incorrect natural language detection #200

cdelledonne commented Jul 19, 2022

Ousret commented Jul 19, 2022 •

edited

Loading

Ousret commented Jul 19, 2022

cdelledonne commented Jul 20, 2022

Ousret commented Oct 18, 2022

[DETECTION] Incorrect natural language detection #200

[DETECTION] Incorrect natural language detection #200

Comments

cdelledonne commented Jul 19, 2022

Ousret commented Jul 19, 2022 • edited Loading

Investigation around the natural language detection

What the library sees

Chunk 1

Chunk 2

Chunk 3

Chunk 4

Chunk 5

Why does it say that

What can we do?

What can you do immediately?

When?

Ousret commented Jul 19, 2022

cdelledonne commented Jul 20, 2022

Ousret commented Oct 18, 2022

Ousret commented Jul 19, 2022 •

edited

Loading