📑 Complex PDF Parsing

A comprehensive example codes for extracting content from PDFs

Also, check -> Pdf Parsing Guide

📌 Core Features

📤 Content Extraction

Multiple extraction methods with different tools/libraries:
- Cloud-based: Claude 3.5 Sonnet, GPT-4 Vision, Unstructured.io
- Local: Llama 3.2 11B, Docling, PDFium
- Specialized: Camelot (tables), PDFMiner (text), PDFPlumber (mixed), PyPdf etc
Maintains document structure and formatting
Handles complex PDFs with mixed content including extracting image data

📦 Implementation Options

1. ☁️ Cloud-Based Methods

Claude & Llama: Excellent for complex PDFs with mixed content
GPT-4 Vision: Excellent for visual content analysis
Unstructured.io: Advanced content partitioning and classification

2. 🖥️ Local Methods

Llama 3.2 11B Vision: Image-based PDF processing
Docling: Excellent for complex PDFs with mixed content
PDFium: High-fidelity processing using Chrome's PDF engine
Camelot: Specialized table extraction
PDFMiner/PDFPlumber: Basic text and layout extraction

🔗 Dependencies

📚 Core Libraries

langchain_ollama
langchain_huggingface
langchain_community
FAISS
python-dotenv

⚙️ Implementation-Specific

anthropic        # Claude
openai           # GPT-4 Vision
camelot-py      # Table extraction
docling         # Text processing
pdf2image       # PDF conversion
pypdfium2       # PDFium processing
boto3           # AWS Textract

🛠️ Setup

Environment Variables

ANTHROPIC_API_KEY=your_key_here    # For Claude
OPENAI_API_KEY=your_key_here       # For OpenAI
UNSTRUCTURED_API_KEY=your_key_here # For Unstructured.io

Install Dependencies

pip install -r requirements.txt

Install Ollama & Models (for local processing)

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull required models
ollama pull llama3.1
ollama pull x/llama3.2-vision:11b

📈 Usage

Place PDF files in input/ directory

📄 Example Complex Pdf placed in Input folder

sample-1.pdf: Standard tables
sample-2.pdf: Image-based simple tables
sample-3.pdf: Image-based complex tables
sample-4.pdf: Mixed content (text, tables, images)

📝 Notes

System resources needed for local LLM operations
API keys required for cloud based implementations
Consider PDF complexity when choosing implementation
Ghostscript required for Camelot
Different processors suit different use cases
- Cloud: Complex documents, mixed content
- Local: Simple text, basic tables
- Specialized: Specific content types (tables, forms)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
converted_images/llama		converted_images/llama
input		input
output		output
parser		parser
utils		utils
.env		.env
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pdf-parsing-guide.pdf		pdf-parsing-guide.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📑 Complex PDF Parsing

📌 Core Features

📤 Content Extraction

📦 Implementation Options

1. ☁️ Cloud-Based Methods

2. 🖥️ Local Methods

🔗 Dependencies

📚 Core Libraries

⚙️ Implementation-Specific

🛠️ Setup

📈 Usage

📄 Example Complex Pdf placed in Input folder

📝 Notes

About

Releases

Packages

Languages

License

genieincodebottle/parsemypdf

Folders and files

Latest commit

History

Repository files navigation

📑 Complex PDF Parsing

📌 Core Features

📤 Content Extraction

📦 Implementation Options

1. ☁️ Cloud-Based Methods

2. 🖥️ Local Methods

🔗 Dependencies

📚 Core Libraries

⚙️ Implementation-Specific

🛠️ Setup

📈 Usage

📄 Example Complex Pdf placed in Input folder

📝 Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages