Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support MHTML Conversion #659

Open
spencerthayer opened this issue Dec 29, 2024 · 0 comments
Open

[FEAT] Support MHTML Conversion #659

spencerthayer opened this issue Dec 29, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@spencerthayer
Copy link

It would be incredibly helpful if Docling could support the conversion of MHTML files to Markdown, HTML, and JSON. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance Docling’s utility, especially for users working with offline web archives or needing to extract structured data from web-based documents.

Why It’s Useful
• Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
• Consistency with Existing Formats: Since Docling already supports HTML and other document types, extending this to MHTML would align with its existing capabilities.
• Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.

Proposed Functionality
1. Input: Allow .mhtml files as valid inputs for the convert method and CLI.
2. Conversion Process:
• Extract the HTML content from the MHTML container.
• Resolve embedded resources (e.g., images, CSS) to ensure clean and consistent output.
• Process the HTML content using the existing pipeline for HTML conversion.
3. Output: Conversion outputs to Markdown, HTML, and JSON, similar to existing formats.

Examples

CLI:

docling convert path-to-file.mhtml --output markdown

Python API:

from docling.document_converter import DocumentConverter

source = "example.mhtml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Challenges & Considerations
• Parsing Embedded Resources: Properly handling and optionally excluding embedded resources might require additional tooling or dependencies.
• Dependency Management: Support for MHTML could introduce new dependencies for MIME encapsulated data parsing.

References
• MHTML Format Overview
• Python Libraries for MHTML Parsing

I believe this feature would significantly enhance Docling’s capabilities and broaden its appeal. Thank you for considering this request!

@spencerthayer spencerthayer added the enhancement New feature or request label Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants