This project extracts and processes bank statements using OCR and various Python libraries, to give a useable dataframe. The Pandas dataframe can then be used for expenditure analysis and tracking data.
Currently, this parser works only for HDFC Bank and Kotak Mahindra Bank statements.
Before you can run this project, you need to install some system dependencies and Python libraries.
-
Update the package list:
sudo apt-get update
-
Install Poppler-utils:
sudo apt-get install -y poppler-utils
-
Install Tesseract-OCR:
sudo apt-get install -y tesseract-ocr
-
Install the required Python libraries:
pip install -r requirements.txt
-
Install Poppler:
- Download Poppler for Windows from http://blog.alivate.com.au/poppler-windows/.
- Extract the downloaded zip file and add the
bin
folder to your system PATH.
-
Install Tesseract-OCR:
- Download the Tesseract installer from https://github.com/UB-Mannheim/tesseract/wiki.
- Run the installer and add Tesseract to your system PATH.
-
Install the required Python libraries:
pip install -r requirements.txt
In your Python script, you might need to set the path to the Tesseract executable if it's not in your system PATH.
import pytesseract
# Example for Windows
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Example for Linux (if not in default path)
# pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract
Replace the filePath variable with the path of the bank statement to be processed.
filePath = "samples/Statement April-Aug 2021.pdf"
To get output use the following command.
python3 main.py
Thank You!