Welcome to Pyextract! Pyextract is a powerful tool for extracting text from images using Tesseract OCR, and it's designed to work with a variety of languages.
Supports multiple languages, with the option to specify the desired language for text extraction. Option to save extracted data to a specified location. Ability to use images from the clipboard or a specific directory for text extraction. Prints the result to the console.
Before you begin, ensure you have the following installed on your system:
- Python 3.x
- Git
- Homebrew (for macOS users)
- Tesseract OCR (Installation Guide below)
- Tessdata for more language support
Clone the Repository:
git clone https://github.com/AdrianSchlegel/Pyextract.git
- Install Tesseract OCR:
- On macOS, use Homebrew:
brew install tesseract
For other operating systems, please follow the appropriate instructions to install Tesseract OCR.
- Install Tessdata:
Tessdata is required for language support in Tesseract. Install it using Homebrew:
brew install tessdata
- Configuration Edit the configuration settings according to your needs:
language: Set to 'eng' for english. For a list of available languages, see: Tesseract OCR Languages save_data: Set to False if you don't want to save the extracted data. If True, specify the save_location. save_location: The directory where extracted data will be saved. Only relevant if save_data is True. use_clipboard_for_input: Set to True to use images from the clipboard for text extraction. If False, the script will use images from the path_to_directory. path_to_directory: The directory containing images for text extraction. This is used if use_clipboard_for_input is False. print_result: Set to True to print the extracted text to the console. Usage After configuring, run the Pyextract script. If use_clipboard_for_input is True, copy an image to your clipboard, and the script will attempt to extract text from it. If it's False, the script will use images from the specified directory.
Notes It's essential to ensure that Tesseract OCR and Tessdata are correctly installed for the script to work. The language data for Tesseract needs to be compatible with the version of Tesseract you have installed.
Pyextract is a versatile tool that leverages the power of Tesseract OCR (Optical Character Recognition) to extract text from images. It's designed to be user-friendly and efficient, providing two primary modes of operation: extracting text from images on the clipboard and processing multiple images from a folder. Here's how each mode works:
The user copies an image to the clipboard. This can be any image that contains text they want to extract.
When Pyextract is executed, it checks the clipboard for any images.
If an image is found on the clipboard, Pyextract uses Tesseract OCR to analyze the image and extract the text.
The extracted text is then placed back onto the clipboard, replacing the original image. This allows the user to easily paste the extracted text into any text field or document.
If the save_data configuration is set to True, the extracted text will also be saved to the specified save_location.
The user places multiple images in a specified folder (path_to_directory). These images are the ones from which they want to extract text.
Pyextract is executed with use_clipboard_for_input set to False.
Pyextract goes through each image in the specified folder, using Tesseract OCR to extract text from each one.
For each image, Pyextract creates a corresponding .txt file containing the extracted text. These text files are typically saved in the same folder as the images or in the specified save_location if save_data is True.
If print_result is set to True, the extracted text from each image is also printed to the console.
- You have all prerequisites installed correctly
- Make sure you have chmod +x pyextract allowed
- make sure the script exists in the $PATH
Happy text extracting with Pyextract!