ValueError: invalid literal for int() with base 10 #28

bzh4bzh · 2020-11-23T00:39:58Z

ValueError: 'invalid literal for int() with base 10: '"₪ץ'' (several different words get caught here)

function get_subtitles in api.py at line 11
v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
function run_ocr in video.py at line 52
for i, data in enumerate(it_ocr)
function in video.py at line 52
for i, data in enumerate(it_ocr)
function init in models.py at line 32
block_num, conf = int(block_num), int(conf)

MadHundred · 2021-01-25T10:02:37Z

same problem
python3.8
tesseract-ocr-w64-v5.0.0-alpha.20201127

HarryRudolph · 2021-01-25T10:12:44Z

I also had this same problem, seemed to only be with reading Hebrew. Could be a right to left thing?

MadHundred · 2021-01-25T10:23:17Z

@HarryRudolph Yes , I think it's a right to left languages problem.
Error Log :
ValueError: invalid literal for int() with base 10: 'ارره'

That ارره is a Persian word , it seems have a problem on RTL languages.

Code :
print(get_subtitles('video.mp4', lang='fas', sim_threshold=70, conf_threshold=65))

MadHundred · 2021-01-25T15:42:49Z

a debug from models.py with print of word_data print(word_data):

            word_data = l.split()
            print(word_data) // <-- this line added
            if len(word_data) < 12:

this is the last lines that got an error :

['4', '1', '1', '1', '2', '0', '607', '76', '111', '74', '-1']

['5', '1', '1', '1', '2', '1', '607', '76', '111', '97', '20', '4']

['4', '1', '1', '1', '3', '0', '217', '169', '486', '71', '-1']

['5', '1', '1', '1', '3', '1', '306', '162', '212', '78', '1', 'لارنج']

['5', '1', '1', '1', '3', '2', '191', '189', '100', '51', '0', 'ارره', '\u200f']

Program crash when word_data got 13 column instead of 12.
So I added a skip for more than 13 columns with this :

if len(word_data) > 12:
     continue

Program will work until end but the result at end is just an half a line .

PlaylistsTrance · 2021-05-23T22:35:56Z

In models.py, replace line 32: block_num, conf = int(block_num), int(conf) with block_num, conf = int(block_num), int(float(conf)).
The issue is that conf is a string of a float value, which int() is not able to convert. By doing float(conf), the float value string is correctly converted into a float, which is able to be converted to an int with int().

HarryRudolph · 2021-05-25T09:08:40Z

@PlaylistsTrance Your solution leads to this error:

block_num, conf = int(block_num), int(float(conf))
ValueError: could not convert string to float: 'שם'

It seems that for some reason the OCRed text is being stored in conf? I am assuming this is incorrect and that conf should be storing an integer/float representing percentage confidence.

The assignment in line 31 of models.py is maybe getting confused with the right to left text?
_, _, block_num, *_, conf, text = word_data

MadHundred · 2021-05-25T15:13:14Z

@HarryRudolph
I've check parameters that given from Tesseract and it seems the problems are just with this two :

Problem 1 :
On RTL languages we got one more parameter that indicate it's RTL. some word_data have 13 parameter instead of 12.
So add this line after if len(word_data) < 12: # no word is predicted continue will solve this.

            if len(word_data) == 13:
                _, _, block_num, *_, conf, text, _ = word_data
            else:
                _, _, block_num, *_, conf, text = word_data

Problem 2 :
Some of lines got a confidence value in float or StringFloat that got an error of invalid literal for int() with base 10.
To solve this I've added a method (is_float) to check if conf is float or not with this after __init__ :

        def is_float(value):
            try:
                float(value)
                return True
            except:
                return False

And replace block_num, conf = int(block_num), int(conf) with below codes :

            if is_float(conf):
                block_num, conf = int(block_num), int(float(conf))
            else:
                block_num, conf = int(block_num), int(conf)

Result :
Program will run without any error but I've just tested this with Arabic/Persian languages but it seems the Tesseract don't get a good OCR on them and the result is not what I want.
Please test it on other languages like Hebrew and feedback.

This was referenced Nov 15, 2021

运行错误 #17

Open

執行後沒有結果 #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 10 #28

ValueError: invalid literal for int() with base 10 #28

bzh4bzh commented Nov 23, 2020

MadHundred commented Jan 25, 2021 •

edited

Loading

HarryRudolph commented Jan 25, 2021

MadHundred commented Jan 25, 2021 •

edited

Loading

MadHundred commented Jan 25, 2021 •

edited

Loading

PlaylistsTrance commented May 23, 2021 •

edited

Loading

HarryRudolph commented May 25, 2021

MadHundred commented May 25, 2021

ValueError: invalid literal for int() with base 10 #28

ValueError: invalid literal for int() with base 10 #28

Comments

bzh4bzh commented Nov 23, 2020

MadHundred commented Jan 25, 2021 • edited Loading

HarryRudolph commented Jan 25, 2021

MadHundred commented Jan 25, 2021 • edited Loading

MadHundred commented Jan 25, 2021 • edited Loading

PlaylistsTrance commented May 23, 2021 • edited Loading

HarryRudolph commented May 25, 2021

MadHundred commented May 25, 2021

MadHundred commented Jan 25, 2021 •

edited

Loading

MadHundred commented Jan 25, 2021 •

edited

Loading

MadHundred commented Jan 25, 2021 •

edited

Loading

PlaylistsTrance commented May 23, 2021 •

edited

Loading