-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DETECTION] Incorrect natural language detection #200
Comments
Investigation around the natural language detectionThis library, by default, extracts five chunks. Of those, here is the characters' analysis debug output. What the library seesChunk 1DEBUG COMMON [('o', 28), ('i', 22), ('a', 20), ('e', 19), ('r', 17), ('n', 16), ('l', 13), ('t', 9), ('v', 8), ('s', 6), ('g', 6), ('d', 5), ('p', 5), ('b', 5), ('u', 5), ('c', 5), ('m', 4), ('z', 2), ('h', 2), ('f', 1)] Chunk 2DEBUG COMMON [('i', 24), ('e', 23), ('o', 17), ('r', 15), ('n', 14), ('g', 11), ('a', 11), ('s', 11), ('t', 10), ('l', 5), ('c', 5), ('z', 4), ('d', 4), ('u', 4), ('m', 4), ('h', 4), ('v', 3), ('p', 3), ('b', 2), ('à', 1), ('f', 1), ('è', 1)] Chunk 3DEBUG COMMON [('i', 45), ('e', 28), ('o', 27), ('n', 21), ('a', 20), ('r', 20), ('s', 17), ('l', 12), ('t', 12), ('m', 8), ('c', 7), ('d', 7), ('z', 6), ('p', 5), ('v', 5), ('b', 4), ('g', 4), ('u', 3), ('q', 2), ('h', 2), ('f', 1), ('é', 1), ('è', 1)] Chunk 4DEBUG COMMON [('e', 33), ('i', 33), ('o', 23), ('s', 22), ('a', 18), ('t', 17), ('n', 16), ('u', 11), ('r', 11), ('c', 10), ('m', 9), ('l', 8), ('d', 7), ('p', 6), ('f', 4), ('v', 3), ('q', 3), ('b', 3), ('h', 2), ('è', 1), ('z', 1)] Chunk 5DEBUG COMMON [('o', 31), ('e', 24), ('a', 21), ('n', 18), ('l', 17), ('r', 15), ('s', 13), ('t', 13), ('c', 10), ('i', 9), ('m', 8), ('d', 8), ('u', 6), ('b', 5), ('v', 4), ('p', 2), ('z', 2), ('f', 2), ('q', 1), ('g', 1)] Why does it say thatBased on the characters' frequencies bellow "English": [
"e",
"a",
"t",
"i",
"o",
"n",
"s",
"r",
"h",
"l",
"d",
"c",
"u",
"m",
"f",
"p",
"g",
"w",
"y",
"b",
"v",
"k",
"x",
"j",
"z",
"q",
],
...
"Italian": [
"e",
"i",
"a",
"o",
"n",
"l",
"t",
"r",
"s",
"c",
"d",
"u",
"p",
"m",
"g",
"v",
"f",
"b",
"z",
"h",
"q",
"è",
"à",
"k",
"y",
"ò",
], AND def characters_popularity_compare(
language: str, ordered_characters: List[str]
) -> float:
"""
Determine if an ordered characters list (by occurrence from most appearances to rarest) matches a particular language.
The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
Beware that its function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
""" We try to compare our extraction with it but NOT in a strict way. That's why the Latin-based lng might be sometime entangled. The main goal of charset-normalizer is still to offer you the best suiting character encoding. Being stricter on natural language detection is counterintuitive to our main goal (in most cases). What can we do?My first idea going forward is to patch the function What can you do immediately?Using a dedicated natural language (ngram) detector even if it slows down your process. I infer that you use the language to rename the file using the proper LG.SRT where LG = language iso 2-char. When?I am not confident tweaking this as of right now, I must do some thorough thinking and planning first. Tho' contributions are Hope that explains things. |
Could you try against your dataset of srt the branch |
Thank you so much for the very thorough explanation, I appreciate that. I tried your new branch and it does produce a correct result for the subtitle file that I shared, but the results for my database of subtitles are all over the place unfortunately. Here's my complete database of subtitles, hope it can be of some help: Most subtitles have a two-letter language code in their name, so it's easy to verify the correctness of this library against those files. Other files don't have a language code, and as you guessed I was hoping to use this library to rename them. |
I have improved the language detector in v3.0 (rc1), |
Notice
I hereby announce that my raw input is not :
Provide the file
2001 A Space Odyssey (1968).it.srt.txt
Verbose output
Expected encoding
I use
charset_normalizer
mostly to detect language of subtitle files. The results are not those expected on a bunch of files, I just posted one here but let me know if you'd like to have more samples. In this particular case, I'm expecting to get Italian as language, but I get English instead.Desktop
The text was updated successfully, but these errors were encountered: