You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just started using Docling and I really like the results of the Hybrid Chunking mechanism (with merge_peers enabled) as it is able to effectively capture content that is separated across pages. However, I think it would be really useful if the chunking mechanism would have an additional parameter for ignoring specific texts like the page_footers and page_headers (this is currently captured and labelled as such by the models) while merging the content. I am attaching a screenshot of actual text and its parsed chunk for better clarity. Moreover, I am also adding an image showing the corresponding footer and header texts in the DoclingDocument.
...
Alternatives
I tried implementing this on my own by playing with the base Docling Document that was returned but I thought it would be a good feature to have as it would enable people to ignore repetitive content within a document.
...
The text was updated successfully, but these errors were encountered:
## filter
remove_text_labels = [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]
remove_label_text = []
def is_number(string):
try:
float(string) # Try converting to a float
return True
except ValueError:
return False
# remove text with label in remove_text_labels
for index, text in enumerate(conv_result.document.texts):
if text.label in remove_text_labels:
if not is_number(text.orig):
remove_label_text.append(text.orig)
# conv_result.document.texts.remove(text)
conv_result.document.texts[index].orig = ""
conv_result.document.texts[index].text = ""
# remove duplicate
remove_label_text = list(set(remove_label_text))
# optinal: remove similar text in remove_lable_text
for index, text in enumerate(conv_result.document.texts):
if text.orig in remove_label_text:
# conv_result.document.texts.remove(text)
conv_result.document.texts[index].orig = ""
conv_result.document.texts[index].text = ""
I have written a temporary workaround above.
And I have some suggestions to the docling structure.
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
fp.write(json.dumps(conv_result.document.export_to_dict()))
# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED))
The JSON DomTree shows the "texts" list field remains fragmented(a sentence or single text line is separated into many text blocks), which indicates that the tag label may not be corrected and let the error drop into text blocks.
In the export_to_markdown function, we can see the logic of combining text blocks into a better meaningful paragraph. should such logic remain in markdown or should it be moved to DomTree construction?
What is more, I guess the label is tagged in the layout model similar to microsoft/layoutlmv3-base. thus it is a good question what is the new label of the "combined text blocks"? since we can't get help from the layout model anymore.
Requested feature
I just started using Docling and I really like the results of the Hybrid Chunking mechanism (with merge_peers enabled) as it is able to effectively capture content that is separated across pages. However, I think it would be really useful if the chunking mechanism would have an additional parameter for ignoring specific texts like the page_footers and page_headers (this is currently captured and labelled as such by the models) while merging the content. I am attaching a screenshot of actual text and its parsed chunk for better clarity. Moreover, I am also adding an image showing the corresponding footer and header texts in the DoclingDocument.
...
Alternatives
I tried implementing this on my own by playing with the base Docling Document that was returned but I thought it would be a good feature to have as it would enable people to ignore repetitive content within a document.
...
The text was updated successfully, but these errors were encountered: