Replies: 5 comments
-
There are some mistakes in the above phrase:
If you only work with pypdfium2 functions, you needn't worry about coordinate conversion at all and can just use pdf coordinates as-is, i. e. you simply pass the result of If you want to draw to a bitmap, then yes, you will need to translate from PDF to bitmap coordinates. Either manually or via |
Beta Was this translation helpful? Give feedback.
-
And an important note on |
Beta Was this translation helpful? Give feedback.
-
@mara004 First of all, I did search around for open and closed issues, my bad if I missed something. Also my bad for having a typo in the post. bbox coordinates are required to sort the text into reading order - it is not uncommon that extracted text is returned out of reading sequence. Another typical example is skewed documents where simple logic like sort first by y-axis and then by x-axis is useless. bboxes need to be adjusted based on information of the document layout, before they can be sorted. Only by having correct bbox-coordinates one can then sort the bboxes and words in them to get a coherent raw text output, even if the bboxes do capture the words perfectly. Additionally if one wishes to use the bboxes and words as ground truth for ML applications, the bboxes have to be converted into a more typical coordinate system where x0/y0 is the upper left corner of an image. COCO format is an example of this. So if the document is rotated, it is much easier to rotate the bboxes first and then pass the output through code that takes care of sorting. As far as I know, pypdfium2 currently ranks 2nd in terms of accuracy and speed of text extraction after PyMuPDF. Unlike PyMuPDF which is AGPL licensed, pypdfium2 is actually free & open source and a better choice imo if the scope of the project can change over time. Just a point of view as to the benefits of this library compared to other alternatives in the OSS space. Here's an example: Red boxes are drawn by using bbox values as is, blue is after rotation. |
Beta Was this translation helpful? Give feedback.
-
Related issues are #228 and #214. Thanks for elaborating on the use case. Don't misunderstand me, I agree that coordinate normalization is vital for some use cases. However, I don't currently have time/interest to dive into this topic, and there are higher priority tasks ATM (mainly consolidating tests and existing helpers).
|
Beta Was this translation helpful? Give feedback.
-
I'm using pypdfium2 to extract text from PDFs. First I open the file, iterate over the pages and get the rectangles that I then iterate over to get bboxes. Then I use those bboxes to get text.
The bboxes are not in the regular PDF coordinate system (top left = 0, 0), so I change those to get a more typical
x0, y0, x1, y1
coordinate system.However, if the page is rotated, the text is extracted fine but the bboxes are still rotated. I fix that with
This works fine, but I'm not sure if this is a valid fix or not. Another alternative I tried was this with pypdfium.raw
But this does not seem to give correct rectangles for bboxes. Rotation definitely is rotating them, but the bboxes are different than those I get with the first version of the script.
What is the correct approach to extracting text from rotated documents?
Beta Was this translation helpful? Give feedback.
All reactions