Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Label Filters in Hybrid Chunking #667

Open
Dhairya1007 opened this issue Dec 31, 2024 · 2 comments
Open

Adding Label Filters in Hybrid Chunking #667

Dhairya1007 opened this issue Dec 31, 2024 · 2 comments
Assignees
Labels
chunker enhancement New feature or request

Comments

@Dhairya1007
Copy link

Requested feature

I just started using Docling and I really like the results of the Hybrid Chunking mechanism (with merge_peers enabled) as it is able to effectively capture content that is separated across pages. However, I think it would be really useful if the chunking mechanism would have an additional parameter for ignoring specific texts like the page_footers and page_headers (this is currently captured and labelled as such by the models) while merging the content. I am attaching a screenshot of actual text and its parsed chunk for better clarity. Moreover, I am also adding an image showing the corresponding footer and header texts in the DoclingDocument.

docling_hybrid_chunking_enhancement

docling_texts_labels

...

Alternatives

I tried implementing this on my own by playing with the base Docling Document that was returned but I thought it would be a good feature to have as it would enable people to ignore repetitive content within a document.
...

@Dhairya1007 Dhairya1007 added the enhancement New feature or request label Dec 31, 2024
@cau-git cau-git added the chunker label Jan 6, 2025
@doncat99
Copy link

doncat99 commented Jan 7, 2025

Is there a temporary solution? Directly delete nodes in the document class?

WX20250107-123421@2x

@doncat99
Copy link

doncat99 commented Jan 7, 2025

## filter
remove_text_labels = [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]
remove_label_text = []

def is_number(string):
    try:
        float(string)  # Try converting to a float
        return True
    except ValueError:
        return False

# remove text with label in remove_text_labels
for index, text in enumerate(conv_result.document.texts):
    if text.label in remove_text_labels:
        if not is_number(text.orig):
            remove_label_text.append(text.orig)
        # conv_result.document.texts.remove(text)
        conv_result.document.texts[index].orig = ""
        conv_result.document.texts[index].text = ""

# remove duplicate
remove_label_text = list(set(remove_label_text))

# optinal: remove similar text in remove_lable_text
for index, text in enumerate(conv_result.document.texts):
    if text.orig in remove_label_text:
        # conv_result.document.texts.remove(text)
        conv_result.document.texts[index].orig = ""
        conv_result.document.texts[index].text = ""

I have written a temporary workaround above.

And I have some suggestions to the docling structure.

with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict()))

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED))

The JSON DomTree shows the "texts" list field remains fragmented(a sentence or single text line is separated into many text blocks), which indicates that the tag label may not be corrected and let the error drop into text blocks.

In the export_to_markdown function, we can see the logic of combining text blocks into a better meaningful paragraph. should such logic remain in markdown or should it be moved to DomTree construction?

What is more, I guess the label is tagged in the layout model similar to microsoft/layoutlmv3-base. thus it is a good question what is the new label of the "combined text blocks"? since we can't get help from the layout model anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chunker enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants