Adding Label Filters in Hybrid Chunking #667

Dhairya1007 · 2024-12-31T06:00:31Z

Requested feature

I just started using Docling and I really like the results of the Hybrid Chunking mechanism (with merge_peers enabled) as it is able to effectively capture content that is separated across pages. However, I think it would be really useful if the chunking mechanism would have an additional parameter for ignoring specific texts like the page_footers and page_headers (this is currently captured and labelled as such by the models) while merging the content. I am attaching a screenshot of actual text and its parsed chunk for better clarity. Moreover, I am also adding an image showing the corresponding footer and header texts in the DoclingDocument.

...

Alternatives

I tried implementing this on my own by playing with the base Docling Document that was returned but I thought it would be a good feature to have as it would enable people to ignore repetitive content within a document.
...

doncat99 · 2025-01-07T04:35:49Z

Is there a temporary solution? Directly delete nodes in the document class?

doncat99 · 2025-01-07T05:48:25Z

## filter
remove_text_labels = [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]
remove_label_text = []

def is_number(string):
    try:
        float(string)  # Try converting to a float
        return True
    except ValueError:
        return False

# remove text with label in remove_text_labels
for index, text in enumerate(conv_result.document.texts):
    if text.label in remove_text_labels:
        if not is_number(text.orig):
            remove_label_text.append(text.orig)
        # conv_result.document.texts.remove(text)
        conv_result.document.texts[index].orig = ""
        conv_result.document.texts[index].text = ""

# remove duplicate
remove_label_text = list(set(remove_label_text))

# optinal: remove similar text in remove_lable_text
for index, text in enumerate(conv_result.document.texts):
    if text.orig in remove_label_text:
        # conv_result.document.texts.remove(text)
        conv_result.document.texts[index].orig = ""
        conv_result.document.texts[index].text = ""

I have written a temporary workaround above.

And I have some suggestions to the docling structure.

with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict()))

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED))

The JSON DomTree shows the "texts" list field remains fragmented(a sentence or single text line is separated into many text blocks), which indicates that the tag label may not be corrected and let the error drop into text blocks.

In the export_to_markdown function, we can see the logic of combining text blocks into a better meaningful paragraph. should such logic remain in markdown or should it be moved to DomTree construction?

What is more, I guess the label is tagged in the layout model similar to microsoft/layoutlmv3-base. thus it is a good question what is the new label of the "combined text blocks"? since we can't get help from the layout model anymore.

Dhairya1007 added the enhancement New feature or request label Dec 31, 2024

cau-git added the chunker label Jan 6, 2025

cau-git assigned vagenas Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Label Filters in Hybrid Chunking #667

Adding Label Filters in Hybrid Chunking #667

Dhairya1007 commented Dec 31, 2024

doncat99 commented Jan 7, 2025

doncat99 commented Jan 7, 2025

Adding Label Filters in Hybrid Chunking #667

Adding Label Filters in Hybrid Chunking #667

Comments

Dhairya1007 commented Dec 31, 2024

Requested feature

Alternatives

doncat99 commented Jan 7, 2025

doncat99 commented Jan 7, 2025