中文 | English
Filparser is a powerful gRPC-based document parsing service that extracts structure and content from various document formats. It not only provides high-performance, scalable document analysis capabilities, but also supports intelligent document chunking based on layout analysis, directly providing high-quality text chunks for Retrieval-Augmented Generation (RAG) systems.
Used for PDF document layout detection with following components:
0: 'title', # Title
1: 'plain text', # Text
2: 'abandon', # Includes headers, footers, page numbers, and page annotations
3: 'figure', # Image
4: 'figure_caption', # Image caption
5: 'table', # Table
6: 'table_caption', # Table caption
7: 'table_footnote', # Table footnote
8: 'isolate_formula', # Display formula (this is a layout display formula, lower priority than 14)
9: 'formula_caption', # Display formula label
13: 'inline_formula', # Inline formula
14: 'isolated_formula', # Display formula
15: 'ocr_text'} # OCR result
Used for high-accuracy text-chunk recognition and extraction.
-
Support for more document formats:
-
PDF to Markdown conversion
-
Enhanced Layout Analysis with
LayoutReader
model
# Create and activate environment
conda create -n filparser python=3.8
conda activate filparser
# Build service
sh run.sh build
# Format code
sh run.sh format
sh run.sh pdf
grpcurl \
--import-path ./ \
--proto ./file_parser.proto \
-d '{"file_path": "./2.pdf", "storage_type": "LOCAL"}' \
-emit-defaults \
--plaintext 127.0.0.1:50058 file_parser.FileParser.Parse
grpcurl \
--import-path ./ \
--proto ./file_parser.proto \
-d '{"file_path": "file/1.pdf", "storage_type": "MINIO", "minio_bucket": "test"}' \
-emit-defaults \
--plaintext 127.0.0.1:50058 file_parser.FileParser/Parse
This project is open-sourced under the AGPL-3.0 license.
- PDF-Extract-Kit: PDF parsing
- LayoutLMv3: Layout detection model
- PaddleOCR: OCR model