Filparser

中文 | English

🌟 Introduction

Filparser is a powerful gRPC-based document parsing service that extracts structure and content from various document formats. It not only provides high-performance, scalable document analysis capabilities, but also supports intelligent document chunking based on layout analysis, directly providing high-quality text chunks for Retrieval-Augmented Generation (RAG) systems.

🚀 Models

LayoutLMv3

Used for PDF document layout detection with following components:

 0: 'title',              # Title
 1: 'plain text',         # Text
 2: 'abandon',            # Includes headers, footers, page numbers, and page annotations
 3: 'figure',             # Image
 4: 'figure_caption',     # Image caption
 5: 'table',              # Table
 6: 'table_caption',      # Table caption
 7: 'table_footnote',     # Table footnote
 8: 'isolate_formula',    # Display formula (this is a layout display formula, lower priority than 14)
 9: 'formula_caption',    # Display formula label
 13: 'inline_formula',    # Inline formula
 14: 'isolated_formula',  # Display formula
 15: 'ocr_text'}          # OCR result

PaddleOCR

Used for high-accuracy text-chunk recognition and extraction.

TODO

Support for more document formats:
PDF to Markdown conversion
Enhanced Layout Analysis with LayoutReader model

Quick Start

Environment Setup

# Create and activate environment

conda create -n filparser python=3.8

conda activate filparser

Installation

# Build service

sh run.sh build

# Format code

sh run.sh format

Running the Server

sh run.sh pdf

Testing

Local

grpcurl \
    --import-path ./ \
    --proto ./file_parser.proto \
    -d '{"file_path": "./2.pdf", "storage_type": "LOCAL"}' \
    -emit-defaults \
    --plaintext 127.0.0.1:50058 file_parser.FileParser.Parse

MinIO

grpcurl \
    --import-path ./ \
    --proto ./file_parser.proto \
    -d '{"file_path": "file/1.pdf", "storage_type": "MINIO", "minio_bucket": "test"}' \
    -emit-defaults \
    --plaintext 127.0.0.1:50058 file_parser.FileParser/Parse

License

This project is open-sourced under the AGPL-3.0 license.

Acknowledgement

PDF-Extract-Kit: PDF parsing
LayoutLMv3: Layout detection model
PaddleOCR: OCR model

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
log		log
models		models
modules		modules
parsers		parsers
protos		protos
rpc		rpc
service		service
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Filparser

🌟 Introduction

🚀 Models

LayoutLMv3

PaddleOCR

TODO

Quick Start

Environment Setup

Installation

Running the Server

Testing

Local

MinIO

License

Acknowledgement

About

Releases

Packages

Languages

License

zheng0116/Filparser

Folders and files

Latest commit

History

Repository files navigation

Filparser

🌟 Introduction

🚀 Models

LayoutLMv3

PaddleOCR

TODO

Quick Start

Environment Setup

Installation

Running the Server

Testing

Local

MinIO

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages