Skip to content

jitesoft/docker-tesseract-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tesseract OCR.

Docker Pulls Back project

Tesseract OCR - Ubuntu and Alpine linux images.

Tesseract and Leptonica are both built from source for each platform and distro, supported platforms are amd64 (x86_64) arm64 (aarch64).

Tags

Versions indicate OS version (or the name in case of alpine), the images with 4- prefix uses tesseract version 4 while images without the prefix uses version 5.

All versions use the same training data.

Images can be found at:

  • Docker hub: jitesoft/tesseract-ocr
  • GitLab: registry.gitlab.com/jitesoft/dockerfiles/tesseract
  • GitHub: ghcr.io/jitesoft/tesseract
  • Quay: quay.io/jitesoft/tesseract

Dockerfile

Dockerfile can be found at GitLab or GitHub

Training and languages

The default image have the english training data installed from start. The training data used is the "fast" data. It parses quicker but not at best quality.
It's possible to train another language by invoking the train-lang script, followed by the language code (ISO 639-2 eng, swe etc). If you wish to use fast or best, add that as an optional parameter after the language code (train-lang eng --fast) else use the standard without any extra arg.
The above could easily be done in a derived image:

FROM jitesoft/tesseract-ocr
RUN train-lang bul --fast

The languages are downloaded from the official tesseract tessdata repositories.

For a full list of supported languages check the following links:

https://github.com/tesseract-ocr/tessdata
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast

It is also possible to just copy a traineddata file to the /usr/local/share/tessdata (/usr/share/tessdata on alpine) directory of the container.

Example execution

docker pull jitesoft/tesseract-ocr
docker run -v /path/to/image/img.jpg:/tmp/img.jpg jitesoft/tesseract-ocr /tmp/img.jpg stdout

Use high DPI image for best result. Higher DPI does increase the time to run though.

Image labels

This image follows the Jitesoft image label specification 1.0.0.

Licenses

The images and scripts in the repository are released under the MIT license.
Tesseract is released under the Apache License v2

Notice: The tesseract source have been modified with a patch (alpine/tess.patch) to allow for compilation in alpine linux.

Sponsors

Jitesoft images are built via GitLab CI on runners hosted by the following wonderful organisations:

Oregon State University - Open Source Lab

The companies above are not affiliated with Jitesoft or any Jitesoft Projects directly.


Sponsoring is vital for the further development and maintaining of open source.
Questions and sponsoring queries can be made by email.
If you wish to sponsor our projects, reach out to the email above or visit any of the following sites:

Open Collective
GitHub Sponsors
Patreon