Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 10 #28

Open
bzh4bzh opened this issue Nov 23, 2020 · 7 comments
Open

ValueError: invalid literal for int() with base 10 #28

bzh4bzh opened this issue Nov 23, 2020 · 7 comments

Comments

@bzh4bzh
Copy link

bzh4bzh commented Nov 23, 2020

ValueError: 'invalid literal for int() with base 10: '"₪ץ'' (several different words get caught here)

function get_subtitles in api.py at line 11
v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
function run_ocr in video.py at line 52
for i, data in enumerate(it_ocr)
function in video.py at line 52
for i, data in enumerate(it_ocr)
function init in models.py at line 32
block_num, conf = int(block_num), int(conf)

@MadHundred
Copy link

MadHundred commented Jan 25, 2021

same problem
python3.8
tesseract-ocr-w64-v5.0.0-alpha.20201127

@HarryRudolph
Copy link

I also had this same problem, seemed to only be with reading Hebrew. Could be a right to left thing?

@MadHundred
Copy link

MadHundred commented Jan 25, 2021

@HarryRudolph Yes , I think it's a right to left languages problem.
Error Log :
ValueError: invalid literal for int() with base 10: 'ارره'

That ارره is a Persian word , it seems have a problem on RTL languages.

Code :
print(get_subtitles('video.mp4', lang='fas', sim_threshold=70, conf_threshold=65))

@MadHundred
Copy link

MadHundred commented Jan 25, 2021

a debug from models.py with print of word_data print(word_data):

            word_data = l.split()
            print(word_data) // <-- this line added
            if len(word_data) < 12:

this is the last lines that got an error :

['4', '1', '1', '1', '2', '0', '607', '76', '111', '74', '-1']

['5', '1', '1', '1', '2', '1', '607', '76', '111', '97', '20', '4']

['4', '1', '1', '1', '3', '0', '217', '169', '486', '71', '-1']

['5', '1', '1', '1', '3', '1', '306', '162', '212', '78', '1', 'لارنج']

['5', '1', '1', '1', '3', '2', '191', '189', '100', '51', '0', 'ارره', '\u200f']

Program crash when word_data got 13 column instead of 12.
So I added a skip for more than 13 columns with this :

if len(word_data) > 12:
     continue

Program will work until end but the result at end is just an half a line .

@PlaylistsTrance
Copy link

PlaylistsTrance commented May 23, 2021

In models.py, replace line 32: block_num, conf = int(block_num), int(conf) with block_num, conf = int(block_num), int(float(conf)).
The issue is that conf is a string of a float value, which int() is not able to convert. By doing float(conf), the float value string is correctly converted into a float, which is able to be converted to an int with int().

@HarryRudolph
Copy link

@PlaylistsTrance Your solution leads to this error:

block_num, conf = int(block_num), int(float(conf))
ValueError: could not convert string to float: 'שם'

It seems that for some reason the OCRed text is being stored in conf? I am assuming this is incorrect and that conf should be storing an integer/float representing percentage confidence.

The assignment in line 31 of models.py is maybe getting confused with the right to left text?
_, _, block_num, *_, conf, text = word_data

@MadHundred
Copy link

@HarryRudolph
I've check parameters that given from Tesseract and it seems the problems are just with this two :

Problem 1 :
On RTL languages we got one more parameter that indicate it's RTL. some word_data have 13 parameter instead of 12.
So add this line after if len(word_data) < 12: # no word is predicted continue will solve this.

            if len(word_data) == 13:
                _, _, block_num, *_, conf, text, _ = word_data
            else:
                _, _, block_num, *_, conf, text = word_data

Problem 2 :
Some of lines got a confidence value in float or StringFloat that got an error of invalid literal for int() with base 10.
To solve this I've added a method (is_float) to check if conf is float or not with this after __init__ :

        def is_float(value):
            try:
                float(value)
                return True
            except:
                return False

And replace block_num, conf = int(block_num), int(conf) with below codes :

            if is_float(conf):
                block_num, conf = int(block_num), int(float(conf))
            else:
                block_num, conf = int(block_num), int(conf)

Result :
Program will run without any error but I've just tested this with Arabic/Persian languages but it seems the Tesseract don't get a good OCR on them and the result is not what I want.
Please test it on other languages like Hebrew and feedback.

This was referenced Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants