Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Tesseract to the more recent version #3

Open
lukaszgo1 opened this issue Apr 19, 2020 · 2 comments
Open

Update Tesseract to the more recent version #3

lukaszgo1 opened this issue Apr 19, 2020 · 2 comments

Comments

@lukaszgo1
Copy link
Owner

The version of Tesseract bundled in the add-on is quite old. Advantages of upgrade are mostly better recognition results and support for more languages.
It would be good to upgrade to the Tesseract 5 (even though it is in the alpha stage it is even better than v4).
Implementation details:

  • The working binary version of Tesseract needs to be found.
  • As the add-on would be too big if all languages were to be included we need to create downloader for them. It need to present checkable list of all of the available languages minus these which are currently installed. This list would be invoked from the OCR settings panel. The Okay button needs to be disabled until there is a selection on the list and disabled again if nothing on the list is selected. We also need to be able to request language download when user tries to recognize something and the configured language is not found in case of upgrade from previous installation,
  • Only English should be bundled by default.
  • As languages have to be stored inside add-on directory (changing location of languages require either recompilation of Tesseract or setting environment variable at each Tesseract run according to https://stackoverflow.com/questions/6950977/tesseract-change-language-file-location) we need to move language files during add-on upgrade as part of installTasks. We need to make sure however that only language data for the version of Tesseract in use are moved and not the one from the older version.
  • deps.py needs to be updated to download chosen version of Tesseract automatically during build.

If user decides not to download the language set in config or if download fails English should be used as a recognition language.

Related to:
nvaccess/nvda#5236
nvaccess/nvda#4706
nvaccess/nvda#4035

@lukaszgo1 lukaszgo1 added this to the 3.0 milestone Apr 19, 2020
@ABuffEr
Copy link

ABuffEr commented Apr 20, 2020

Hi @lukaszgo1 ,
For working binaries, see here; now they have 5.x version, but I remember I found there also 4.x binaries, which I used for an experimental (and unofficial) add-on updating. Note that the page was linked on old wiki, and you find tracks here, so it can be considered affidable, I think.
The page now reports two .exe setup, for 32 and 64-bit; for deps.py, however, it's interesting to know that you can simply decompress them with 7zip.
In addition, I don't know if now it's the same, but with tesseract.exe 4.0 alpha version I was be able to reduce the add-on included files to:

  • libgcc_s_sjlj-1.dll
  • libgif-4.dll
  • libgomp-1.dll
  • libjbig-2.dll
  • libjpeg-8.dll
  • liblept-5.dll
  • liblzma-5.dll
  • libopenjp2.dll
  • libpng16-16.dll
  • libstdc++-6.dll
  • libtesseract-4.dll
  • libtiff-5.dll
  • libwebp-5.dll
  • libwinpthread-1.dll
  • tesseract.exe
  • zlib1.dll

As you can see, they are still present in current setup (I checked w64 only), with small changes in the names, so there could be good possibilities, I hope.
About tessdata, I'm confused by discussion: tesseract.exe --help-extra shows a --tessdata-dir parameter which should be quite simple to manage.
Finally, I agree with the rest of your proposal, so, good work! :)

@ABuffEr
Copy link

ABuffEr commented Apr 20, 2020

Hi again,
thinking... I remember that, updating from 3.x to 4.0 version, in init.py I changed output file to .hocr extension (instead of .html), and then "ocr_word" to "ocrx_word".
In addition, "-l" parameters support multilanguage, with syntax "-l lang1+lang2+..."; so, it'd be really great to have a list with all checkable languages, where, if the language is checked, then the tessdata file has to be present/downloaded and this is used during OCR, and if it is not checked this is not used in OCR (regardless of tessdata file presence). Yes, I know, quite annoying to do.

@lukaszgo1 lukaszgo1 removed this from the 3.0 milestone Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants