You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be incredibly helpful if Docling could support the conversion of MHTML files to Markdown, HTML, and JSON. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance Docling’s utility, especially for users working with offline web archives or needing to extract structured data from web-based documents.
Why It’s Useful
• Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
• Consistency with Existing Formats: Since Docling already supports HTML and other document types, extending this to MHTML would align with its existing capabilities.
• Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.
Proposed Functionality
1. Input: Allow .mhtml files as valid inputs for the convert method and CLI.
2. Conversion Process:
• Extract the HTML content from the MHTML container.
• Resolve embedded resources (e.g., images, CSS) to ensure clean and consistent output.
• Process the HTML content using the existing pipeline for HTML conversion.
3. Output: Conversion outputs to Markdown, HTML, and JSON, similar to existing formats.
from docling.document_converter import DocumentConverter
source = "example.mhtml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Challenges & Considerations
• Parsing Embedded Resources: Properly handling and optionally excluding embedded resources might require additional tooling or dependencies.
• Dependency Management: Support for MHTML could introduce new dependencies for MIME encapsulated data parsing.
References
• MHTML Format Overview
• Python Libraries for MHTML Parsing
I believe this feature would significantly enhance Docling’s capabilities and broaden its appeal. Thank you for considering this request!
The text was updated successfully, but these errors were encountered:
It would be incredibly helpful if Docling could support the conversion of MHTML files to Markdown, HTML, and JSON. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance Docling’s utility, especially for users working with offline web archives or needing to extract structured data from web-based documents.
Why It’s Useful
• Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
• Consistency with Existing Formats: Since Docling already supports HTML and other document types, extending this to MHTML would align with its existing capabilities.
• Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.
Proposed Functionality
1. Input: Allow .mhtml files as valid inputs for the convert method and CLI.
2. Conversion Process:
• Extract the HTML content from the MHTML container.
• Resolve embedded resources (e.g., images, CSS) to ensure clean and consistent output.
• Process the HTML content using the existing pipeline for HTML conversion.
3. Output: Conversion outputs to Markdown, HTML, and JSON, similar to existing formats.
Examples
CLI:
docling convert path-to-file.mhtml --output markdown
Python API:
from docling.document_converter import DocumentConverter
source = "example.mhtml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Challenges & Considerations
• Parsing Embedded Resources: Properly handling and optionally excluding embedded resources might require additional tooling or dependencies.
• Dependency Management: Support for MHTML could introduce new dependencies for MIME encapsulated data parsing.
References
• MHTML Format Overview
• Python Libraries for MHTML Parsing
I believe this feature would significantly enhance Docling’s capabilities and broaden its appeal. Thank you for considering this request!
The text was updated successfully, but these errors were encountered: