-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract add similar characters in Japanese text (ambiguity management?) #1063
Comments
ShreeDevi suggested to try the new traindata files: |
I've added --oem 0 as suggested in #1060 and it don't double the charachters. But with this setting the system outputs wrongly the character ボ instead of ポ. image text: エンジンコンポーネント make ocr with tesseract using --oem=0level page_num block_num par_num line_num word_num left top width height conf text ##ocr with tesseract using --oem=1 level page_num block_num par_num line_num word_num left top width height conf text |
--oem 0 is the conventional engine I unterstood #1060 that he was only asking/checking if --oem 1 is used not suggesting -oem 0 He suggested to download and use the new trainingdata from best directory uploaded some days ago - https://github.com/tesseract-ocr/tessdata/tree/master/best |
Hi, we have noticed that in a japanese text, tesseract doubled or also triples some characters which there are not in the text. Maybe we can imagine that tesseract make some management about character that are very similar and put all of them in the output instead of choosing one. There's some way to avoid this problem in the output?
Example:
text in image= エンジンコンポーネント
text read by tesseract= エンジンコンポボーネント
as you can see the charachter ポ is transleted as two charachter ポボ
maybe because both have a similar high score of confidence and tesseract do not decide which one to use but put both in the text. There's a way to avoid this error?
The text was updated successfully, but these errors were encountered: