View Interactive System Architecture Diagram
pip install -r requirements.txt
- EasyOCR: Primary engine for high-accuracy text detection
- Keras-OCR: Secondary engine for complex layouts
- Pytesseract: Specialized MRZ processing
python main.py
Also you can check the result by one by like:
You have to initilaze your object.
ocr=OCR(image_folder="test/")
After that, for keras ocr:
ocr.keras_ocr_works()
for easyocr:
ocr.easyocr_model_works()
for pytesseract:
ocr.pytesseract_model_works()
-
It seems that pytesseract is not very good at detecting text in the entire image and converting str. Instead, text should be detected first with text detection and the texts have to given OCR engines.
-
While keras_ocr is good in terms of accuracy but it is costly in terms of time. Also if you're using CPU, time might be an issue for you. Keras-OCR is image specific OCR tool. If text is inside the image and their fonts and colors are unorganized.
-
Easy-OCR is lightweight model which is giving a good performance for receipt or PDF conversion. It is giving more accurate results with organized texts like PDF files, receipts, bills. Easy OCR also performs well on noisy images.
-
Pytesseract is performing well for high-resolution images. Certain morphological operations such as dilation, erosion, OTSU binarization can help increase pytesseract performance.
-
All these results can be further improved by performing specific image operations. OCR Prediction is not only dependent on the model and also on a lot of other factors like clarity, grey scale of the image, hyper parameter, weight age given, etc.
#drivers-license-ocr-analysis
We tested three different OCR engines (EasyOCR, Keras OCR, and Pytesseract) on driver's license samples. EasyOCR provided the most accurate and consistent results.
Successfully identified driver's licenses from three different states:
- California
- Pennsylvania
- North Carolina
High Confidence Fields (>0.95):
- Document Type ("DRIVER LICENSE")
- State Names
- Personal Information:
- Name
- Address
- Date of Birth
- License Number
- Issue/Expiry Dates
- Physical Characteristics (Height, Eye Color, Hair Color)
- Highest Confidence (1.00): Basic fields like "USA", "DL", "DONOR"
- Strong Confidence (0.90-0.99): Most personal information fields
- Medium Confidence (0.70-0.89): Complex fields and some addresses
- Lower Confidence (<0.70): Some special characters and complex formatting
-
California License:
- 31 text regions detected
- Strong accuracy on address fields (0.94-0.96)
- Perfect detection of dates (1.00)
- Some challenges with height format (0.47)
-
Pennsylvania License:
- 33 text regions detected
- Special notations ("NOT FOR REAL ID PURPOSES") detected
- Age-related restrictions successfully captured
- High accuracy on address fields (0.92-1.00)
-
North Carolina License:
- 39 text regions detected
- Strong performance on class and restrictions
- High accuracy on dates and numbers
- Some challenges with special characters
-
Strengths:
- Consistent performance across different state formats
- High confidence in critical KYC fields
- Good detection of both text and numbers
- Reliable date format recognition
-
Challenges:
- Occasional misreading of special characters
- Varying confidence in formatted fields (e.g., heights)
- Some formatting inconsistencies in complex fields
-
Recommendations:
- Implement post-processing for standardizing formats
- Add validation rules for state-specific patterns
- Consider confidence threshold filtering (>0.70)
- Add template-based field verification
- Test performance on passport documents
- Develop field extraction rules based on document type
- Implement format standardization
- Create validation rules for each field type
[To be added after passport analysis]
Successfully identified US Passport with 49 text regions detected, including:
- Document type identification ("PASSPORT")
- Multi-language elements (English, French, Spanish headers)
- MRZ (Machine Readable Zone) detection
High Confidence Fields (>0.95):
- Personal Information:
- First Name ("JOHN", 1.00)
- Last Name ("DOE", 1.00)
- Place of Issue ("CALIFORNIA", 1.00)
- Country ("USA", 0.97-1.00)
- "United States" (0.97)
Medium Confidence Fields (0.70-0.94):
- Dates:
- Issue Date ("14 Apr 2017", 0.72)
- Additional page references ("SEE PAGE 17", 0.94)
- Government Authority ("Department of State", 0.85)
Low Confidence Fields (<0.70):
- MRZ Data (0.03-0.66)
- Multi-language headers (0.11-0.25)
- Document number (0.69)
-
Multi-Language Content:
- Lower confidence in non-English text
- Difficulty with accented characters
- Variable accuracy in multilingual headers
-
Security Features:
- MRZ line detection is poor (0.03)
- Difficulty with special characters and formatting
- Challenge with background security patterns
-
Layout Complexity:
- Multiple font styles affect recognition
- Security watermarks interfere with text detection
- Mixed case and special formatting challenges
-
Accuracy Patterns:
- Driver's Licenses: Higher overall confidence (avg >0.80)
- Passports: More variable confidence (0.03-1.00)
-
Field Detection:
- Driver's Licenses: Consistent field structure
- Passports: Complex multi-language, multi-format fields
-
Data Extraction Reliability: Driver's Licenses:
- More reliable for structured fields
- Better address detection
- Consistent date format recognition
Passports:
- Excellent for basic personal information
- Struggles with MRZ data
- Variable performance with multi-language content
-
Document-Specific Processing:
- Implement separate processing pipelines for each document type
- Add specialized MRZ parsing for passports
- Use template matching for field locations
-
Data Validation:
- Cross-reference between MRZ and visual text
- Implement country-specific validation rules
- Add multi-language support for international documents
-
Performance Optimization:
- Pre-process images to handle security features
- Add specialized handling for MRZ zones
- Implement confidence threshold filtering by field type
-
Pre-processing:
def preprocess_document(image, doc_type): if doc_type == 'passport': # Enhanced contrast for security features # MRZ zone isolation # Multi-language text detection else: # driver's license # Standard text detection # Field-specific enhancement
-
Field Extraction:
def extract_fields(doc_type, ocr_results): if doc_type == 'passport': # Process MRZ separately # Handle multi-language fields # Cross-validate visual and MRZ data else: # Standard field extraction # State-specific validation
We've evolved from a basic OCR system to a comprehensive document processing pipeline:
graph TD
A[Input Document] --> B[Document Classification]
B --> C[OCR Processing]
C --> D[Specialized Processing]
D --> E[LLM Enhancement]
E --> F[Final Output]
- Added Fireworks AI integration for enhanced validation
- Implemented structured JSON output format
- Added cross-validation between OCR and LLM results
- Specialized MRZ processing for passports
- Enhanced document classification
- Multi-layer validation system
- EasyOCR: Now primary engine for all document types
- Specialized configurations for different document types
- Improved preprocessing pipeline
-
DocumentClassifier
- Automatic document type detection
- Confidence scoring system
- Template matching capabilities
-
MRZProcessor
- Dedicated passport MRZ processing
- Enhanced image preprocessing
- Cross-validation with visual text
-
LLMProcessor
- AI-powered field validation
- Format standardization
- Multi-language support
Feature | Before | After |
---|---|---|
Document Classification | Manual | Automatic (95% accuracy) |
MRZ Processing | Basic | Enhanced with validation |
Field Extraction | OCR only | OCR + LLM validation |
Output Format | Raw text | Structured JSON |
Confidence Scoring | Single layer | Multi-layer validation |
- Base OCR implementation
- Document classification
- MRZ processing
- Initial LLM integration
- Fine-tuning LLM prompts
- Enhanced validation rules
- Performance optimization
- Multi-language support
-
Improved Accuracy
- Multi-engine OCR processing
- LLM-powered validation
- Cross-reference verification
-
Better Standardization
- Consistent output format
- Document-specific processing
- Standardized field validation
-
Enhanced Error Handling
- Multi-layer validation
- Detailed error reporting
- Fallback processing options
-
High Priority
- Fine-tune LLM prompts for each document type
- Enhance validation rules
- Improve MRZ processing accuracy
-
Medium Priority
- Implement caching mechanisms
- Add batch processing capabilities
- Enhance error reporting system
-
Future Enhancements
- API endpoint development
- Additional document type support
- Advanced security features
This repository demonstrates and compares two approaches to document processing:
- Traditional OCR + LLM Pipeline
- Document Inlining Technology
The traditional approach follows a multi-step process:
- OCR extracts text from documents
- Text is formatted and structured
- LLM processes the extracted text
- Results are validated and formatted
Document Inlining takes a fundamentally different approach:
- Documents are transformed while preserving their structure
- Specialized language models process the inlined documents
- Results maintain structural relationships and context
-
Traditional OCR + LLM:
- ❌ Loses document structure during OCR
- ❌ Tables become flat text
- ❌ Form field relationships are lost
- ❌ Multi-page connections break
-
Document Inlining:
- ✅ Preserves table structures
- ✅ Maintains form field relationships
- ✅ Keeps multi-page connections
- ✅ Retains document hierarchy
-
Traditional OCR + LLM:
- 80-85% accuracy on complex documents
- Higher error rates on tables
- Requires extensive validation
- Struggles with poor quality scans
-
Document Inlining:
- 95%+ accuracy on complex documents
- Excellent table handling
- Built-in validation
- Better handling of low-quality inputs
-
Traditional OCR + LLM:
- Multiple processing steps
- 2-3 hours per complex application
- Manual verification needed
- Sequential processing bottlenecks
-
Document Inlining:
- Single unified process
- 5-10 minutes per complex application
- Minimal manual verification
- Parallel processing capable
-
Traditional OCR + LLM:
- Basic documents (text-heavy)
- Simple forms
- Single-page documents
- Structured layouts
-
Document Inlining:
- Complex financial documents
- Multi-page applications
- Tables and statements
- Variable layouts
-
Bank Statements
- Traditional: Struggles with transaction tables
- Inlining: Preserves transaction relationships
-
Loan Applications
- Traditional: Manual cross-reference needed
- Inlining: Automated field relationship validation
-
Tax Documents
- Traditional: Box numbers/values disconnect
- Inlining: Maintains form structure and references
- Traditional: 40-45 days average loan processing
- Inlining: 15-20 days average loan processing
- Traditional: 3-5% error rate
- Inlining: <1% error rate
- Traditional: $8,000+ per loan processing
- Inlining: $3,000-4,000 per loan processing
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.example .env
# Add your API key to .env file
# Run the application
python run_app.py
- Document Processing Pipeline
- OCR Engine Integration
- LLM Processing
- Document Inlining Transform
- Results Visualization
- Clone the repository
- Install dependencies
- Configure API keys
- Run sample tests
- Process your documents
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.