[FEAT] Support MHTML Conversion #659

spencerthayer · 2024-12-29T18:10:05Z

It would be incredibly helpful if Docling could support the conversion of MHTML files to Markdown, HTML, and JSON. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance Docling’s utility, especially for users working with offline web archives or needing to extract structured data from web-based documents.

Why It’s Useful
• Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
• Consistency with Existing Formats: Since Docling already supports HTML and other document types, extending this to MHTML would align with its existing capabilities.
• Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.

Proposed Functionality
1. Input: Allow .mhtml files as valid inputs for the convert method and CLI.
2. Conversion Process:
• Extract the HTML content from the MHTML container.
• Resolve embedded resources (e.g., images, CSS) to ensure clean and consistent output.
• Process the HTML content using the existing pipeline for HTML conversion.
3. Output: Conversion outputs to Markdown, HTML, and JSON, similar to existing formats.

Examples

CLI:

docling convert path-to-file.mhtml --output markdown

Python API:

from docling.document_converter import DocumentConverter

source = "example.mhtml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Challenges & Considerations
• Parsing Embedded Resources: Properly handling and optionally excluding embedded resources might require additional tooling or dependencies.
• Dependency Management: Support for MHTML could introduce new dependencies for MIME encapsulated data parsing.

References
• MHTML Format Overview
• Python Libraries for MHTML Parsing

I believe this feature would significantly enhance Docling’s capabilities and broaden its appeal. Thank you for considering this request!

spencerthayer added the enhancement New feature or request label Dec 29, 2024

cau-git assigned PeterStaar-IBM and maxmnemonic Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Support MHTML Conversion #659

[FEAT] Support MHTML Conversion #659

spencerthayer commented Dec 29, 2024

[FEAT] Support MHTML Conversion #659

[FEAT] Support MHTML Conversion #659

Comments

spencerthayer commented Dec 29, 2024