A comprehensive example codes for extracting content from PDFs
Also, check -> Pdf Parsing Guide
- Multiple extraction methods with different tools/libraries:
- Cloud-based: Claude 3.5 Sonnet, GPT-4 Vision, Unstructured.io
- Local: Llama 3.2 11B, Docling, PDFium
- Specialized: Camelot (tables), PDFMiner (text), PDFPlumber (mixed), PyPdf etc
- Maintains document structure and formatting
- Handles complex PDFs with mixed content including extracting image data
- Claude & Llama: Excellent for complex PDFs with mixed content
- GPT-4 Vision: Excellent for visual content analysis
- Unstructured.io: Advanced content partitioning and classification
- Llama 3.2 11B Vision: Image-based PDF processing
- Docling: Excellent for complex PDFs with mixed content
- PDFium: High-fidelity processing using Chrome's PDF engine
- Camelot: Specialized table extraction
- PDFMiner/PDFPlumber: Basic text and layout extraction
langchain_ollama
langchain_huggingface
langchain_community
FAISS
python-dotenv
anthropic # Claude
openai # GPT-4 Vision
camelot-py # Table extraction
docling # Text processing
pdf2image # PDF conversion
pypdfium2 # PDFium processing
boto3 # AWS Textract
- Environment Variables
ANTHROPIC_API_KEY=your_key_here # For Claude
OPENAI_API_KEY=your_key_here # For OpenAI
UNSTRUCTURED_API_KEY=your_key_here # For Unstructured.io
- Install Dependencies
pip install -r requirements.txt
- Install Ollama & Models (for local processing)
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Pull required models
ollama pull llama3.1
ollama pull x/llama3.2-vision:11b
- Place PDF files in
input/
directory
- sample-1.pdf: Standard tables
- sample-2.pdf: Image-based simple tables
- sample-3.pdf: Image-based complex tables
- sample-4.pdf: Mixed content (text, tables, images)
- System resources needed for local LLM operations
- API keys required for cloud based implementations
- Consider PDF complexity when choosing implementation
- Ghostscript required for Camelot
- Different processors suit different use cases
- Cloud: Complex documents, mixed content
- Local: Simple text, basic tables
- Specialized: Specific content types (tables, forms)