Skip to content
/ scanocr Public

Script and systemd service file that watches multiple folders for newly arriving pdf file to be processed (OCR)

License

Notifications You must be signed in to change notification settings

efnats/scanocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ScanOCR Script

An automatic OCR (Optical Character Recognition) script for newly added PDF files. Utilizes OCRmyPDF and inotify-tools.

Prerequisites

  • Linux (e.g., Ubuntu)
  • OCRmyPDF
  • inotify-tools

Install prerequisites on Ubuntu:

sudo apt-get -y install ocrmypdf inotify-tools tesseract-ocr-deu

Setup

  1. Clone the repository and copy files:
git clone https://github.com/efnats/scanocr.git
cd scanocr
sudo cp ./scanocr.sh /usr/local/bin
sudo chmod +x /usr/local/bin/scanocr.sh
sudo cp ./scanocr.service /etc/systemd/system/
  1. Adjust the service file by adjusting file paths according to your needs

  2. Setup the service:

sudo systemctl daemon-reload
sudo systemctl enable scanocr.service
sudo systemctl start scanocr.service

Operation

Monitors directories for new files, renaming with timestamp, performing OCR, moving to processed directory, and deleting the original.

Note on the Service File

The service file contains the directive OOMScoreAdjust=-1000. This directive is used to prevent the Out of Memory (OOM) killer from targeting the scanocr service. This is particularly important when running the service in an LXC container with limited RAM (e.g., 500MB). If the system disk is fast, consider raising swap to 1GB to provide additional virtual memory and prevent OOM situations.

About

Script and systemd service file that watches multiple folders for newly arriving pdf file to be processed (OCR)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages