Will this extracts the image embedded in pdf. #365

mailtoshwetha09 · 2024-11-18T14:19:00Z

mailtoshwetha09
Nov 18, 2024

I have pdf which is in japanese which i need to translate it. Using HierarchicalChunker i am able to extract the chunks do the translation. Pdf is having embedded image which is in japanese and i donot see this in chunks. Will this support extraction of embedded image.

SDanaan · 2024-11-19T19:17:39Z

SDanaan
Nov 19, 2024

I, too, have a pdf with images but docling is not returning the images in the result.
It is a public pdf, here is the link.
https://www.pa.gov/content/dam/copapwp-pagov/en/education/documents/instruction/assessment-and-accountability/keystone-exams/keystone-exams-item-and-scoring-samplers/2023%20keystone%20iss%20biology.pdf

It is in English. OCR works well and when reviewing the chunks with this code

doc = result.document
chunks = list(HierarchicalChunker().chunk(doc))
print(chunks[63:81])

I get the text chunks on page 11 but no images. I get the tables and the text for the pages. I had been hoping to use docling for the many pdfs that have not properly labelled their images. Will docling support extraction of these images? Is it waiting for image metadata or can it detect them? Thanks for any help.

4 replies

mailtoshwetha09 Nov 20, 2024
Author

@SDanaan i have tried this https://github.com/DS4SD/docling/blob/main/docs/examples/export_figures.py and able to extract images. Translation of texts in images which i still have to explore

SDanaan Nov 20, 2024

I, too, ran that code. The first set of images it iterates through are the images of each entire page which worked excellently but does not require "detecting images". Later in the code it is supposed to find images within the document, for example on each page. When I went to view images, such as the one on page 11, they were not extracted. I grabbed a "chunk range" which included images
(chunks[63:81])
but no images were in the chunk.

SDanaan Nov 20, 2024

I have made a pdf of just two pages
2023 keystone iss biology-pages-2-pages-1.pdf
and I have used this code to look at the images which have type=NoneType

for item, level in result.document.iterate_items():
  if isinstance(item, PictureItem):
    print("We have: ", item, " at level: ", level)

The results have image=None

We have: self_ref='#/pictures/0' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PICTURE: 'picture'> prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=189.9251708984375, t=695.8134765625, r=422.7407531738281, b=490.1776123046875, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 0))] captions=[] references=[] footnotes=[] image=None annotations=[] at level: 1
We have: self_ref='#/pictures/1' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PICTURE: 'picture'> prov=[ProvenanceItem(page_no=2, bbox=BoundingBox(l=307.37384033203125, t=779.3055419921875, r=328.6535949707031, b=752.5662841796875, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 0))] captions=[] references=[] footnotes=[] image=None annotations=[] at level: 1
The bounding boxes seem correct but no image seems to be available

SDanaan Nov 20, 2024

I got it working and it is working FANTASTICALLY
I had to make some changes to the file handling mentioned in [https://github.com/DS4SD/docling/blob/main/docs/examples/export_figures.py]
Specifically, the file opening protocol
with CURRENT_VARIABLE.open("wb") as fp:
I changed to
with open(CURRENT_VARIABLE, "wb") as fp:
and I cast my output_dir to a string and concatenated that with the file names. It is working very well now. Thanks for the excellent project!

Mar-Lourenco · 2024-11-27T11:03:28Z

Mar-Lourenco
Nov 27, 2024

Hello. I used the DocumentCoverter() and export_to_dict() and through that I am able to access the coordinates of the picture location in the page. Do you know a way for me to be able to access the text inside the picture with docling? Thank you.

3 replies

cau-git Nov 27, 2024
Maintainer

Exposing in-picture text is part of a current development effort. We will announce when it is available.

Mar-Lourenco Nov 27, 2024

Thank you!

pitta-bread Nov 27, 2024

Exposing in-picture text is part of a current development effort. We will announce when it is available.
@cau-git

This would be amazing and very useful! I can imagine it would be possible to do with a small extension to the current pipeline, similar to example but with an OCR process over the images in-line with the already defined OCR options for pages etc. With results placed into the annotations for each ImageItem or a new attribute.

Please let us know if there is an issue we can upvote to help prioritise, help further define the issue, or any other way to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will this extracts the image embedded in pdf. #365

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Will this extracts the image embedded in pdf. #365

Replies: 2 comments · 7 replies

mailtoshwetha09 Nov 20, 2024 Author

cau-git Nov 27, 2024 Maintainer

Replies: 2 comments 7 replies

mailtoshwetha09 Nov 20, 2024
Author

cau-git Nov 27, 2024
Maintainer