Update Tesseract to the more recent version #3

lukaszgo1 · 2020-04-19T21:02:48Z

The version of Tesseract bundled in the add-on is quite old. Advantages of upgrade are mostly better recognition results and support for more languages.
It would be good to upgrade to the Tesseract 5 (even though it is in the alpha stage it is even better than v4).
Implementation details:

The working binary version of Tesseract needs to be found.
As the add-on would be too big if all languages were to be included we need to create downloader for them. It need to present checkable list of all of the available languages minus these which are currently installed. This list would be invoked from the OCR settings panel. The Okay button needs to be disabled until there is a selection on the list and disabled again if nothing on the list is selected. We also need to be able to request language download when user tries to recognize something and the configured language is not found in case of upgrade from previous installation,
Only English should be bundled by default.
As languages have to be stored inside add-on directory (changing location of languages require either recompilation of Tesseract or setting environment variable at each Tesseract run according to https://stackoverflow.com/questions/6950977/tesseract-change-language-file-location) we need to move language files during add-on upgrade as part of installTasks. We need to make sure however that only language data for the version of Tesseract in use are moved and not the one from the older version.
deps.py needs to be updated to download chosen version of Tesseract automatically during build.

If user decides not to download the language set in config or if download fails English should be used as a recognition language.

Related to:
nvaccess/nvda#5236
nvaccess/nvda#4706
nvaccess/nvda#4035

ABuffEr · 2020-04-20T09:01:40Z

Hi @lukaszgo1 ,
For working binaries, see here; now they have 5.x version, but I remember I found there also 4.x binaries, which I used for an experimental (and unofficial) add-on updating. Note that the page was linked on old wiki, and you find tracks here, so it can be considered affidable, I think.
The page now reports two .exe setup, for 32 and 64-bit; for deps.py, however, it's interesting to know that you can simply decompress them with 7zip.
In addition, I don't know if now it's the same, but with tesseract.exe 4.0 alpha version I was be able to reduce the add-on included files to:

libgcc_s_sjlj-1.dll
libgif-4.dll
libgomp-1.dll
libjbig-2.dll
libjpeg-8.dll
liblept-5.dll
liblzma-5.dll
libopenjp2.dll
libpng16-16.dll
libstdc++-6.dll
libtesseract-4.dll
libtiff-5.dll
libwebp-5.dll
libwinpthread-1.dll
tesseract.exe
zlib1.dll

As you can see, they are still present in current setup (I checked w64 only), with small changes in the names, so there could be good possibilities, I hope.
About tessdata, I'm confused by discussion: tesseract.exe --help-extra shows a --tessdata-dir parameter which should be quite simple to manage.
Finally, I agree with the rest of your proposal, so, good work! :)

ABuffEr · 2020-04-20T13:15:41Z

Hi again,
thinking... I remember that, updating from 3.x to 4.0 version, in init.py I changed output file to .hocr extension (instead of .html), and then "ocr_word" to "ocrx_word".
In addition, "-l" parameters support multilanguage, with syntax "-l lang1+lang2+..."; so, it'd be really great to have a list with all checkable languages, where, if the language is checked, then the tessdata file has to be present/downloaded and this is used during OCR, and if it is not checked this is not used in OCR (regardless of tessdata file presence). Yes, I know, quite annoying to do.

lukaszgo1 added this to the 3.0 milestone Apr 19, 2020

lukaszgo1 mentioned this issue Apr 19, 2020

Automate build process #6

Open

lukaszgo1 removed this from the 3.0 milestone Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Tesseract to the more recent version #3

Update Tesseract to the more recent version #3

lukaszgo1 commented Apr 19, 2020

ABuffEr commented Apr 20, 2020 •

edited

Loading

ABuffEr commented Apr 20, 2020

Update Tesseract to the more recent version #3

Update Tesseract to the more recent version #3

Comments

lukaszgo1 commented Apr 19, 2020

ABuffEr commented Apr 20, 2020 • edited Loading

ABuffEr commented Apr 20, 2020

ABuffEr commented Apr 20, 2020 •

edited

Loading