Correct way to deal with rotated documents for text extraction? #235

giuqoob · 2023-07-03T14:32:52Z

giuqoob
Jul 3, 2023

I'm using pypdfium2 to extract text from PDFs. First I open the file, iterate over the pages and get the rectangles that I then iterate over to get bboxes. Then I use those bboxes to get text.

pdf = pdfium.PdfDocument('path/to/file'`)
for i in range(len(pdf)):
    page = pdf[i]
    text_page = pdf.get_textpage()
    rotation = pdf.get_rotation()
    page_height = pdf.get_height()
    rects = text_page.count_rects()
    for rect in range(rects):
        left, bottom, right, top = text_page.get_rect(rect)
        text = text_page.get_text_bounded(left, bottom, right, top)

The bboxes are not in the regular PDF coordinate system (top left = 0, 0), so I change those to get a more typical x0, y0, x1, y1 coordinate system.

x0, x1 = left, right
y0, y1 = page_height - top, page_height - bottom

However, if the page is rotated, the text is extracted fine but the bboxes are still rotated. I fix that with

def rotate_bbox(left: float, bottom: float, right: float, top: float, rotation: float, pdf_width: float, pdf_height: float) -> tuple[float, float, float, float]:
    if (rotation // 90) % 2 != 0:
        x0, x1 = bottom, top
        y0, y1 = left, right
    else:
        x0, x1 = left, right
        y0, y1 = bottom, top
    if (rotation // 180) != 0:
        x0, x1 = pdf_width - x1, pdf_width - x0
    return x0, y0, x1, y1

This works fine, but I'm not sure if this is a valid fix or not. Another alternative I tried was this with pypdfium.raw

from ctypes import c_long, pointer
from pypdfium2.raw import FPDF_PageToDevice

def rot_ctyp(rotation: int) -> int:
    if rotation == 0:
        return 0
    elif rotation == 90:
        return 1
    elif rotation == 180:
        return 2
    elif rotation == 270:
        return 3
    else:
        return None
def _convert_bbox(page, x0: int, y0: int, page_width: int, page_height: int, rotation: int = 0):
    start_x, start_y = 0, 0
    page_x, page_y = x0, y0
    rotate = rot_ctyp(rotation)
    size_x, size_y = page_width, page_height
    device_x, device_y = pointer(c_long()), pointer(c_long())
    success = FPDF_PageToDevice(page, start_x, start_y, size_x, size_y, rotate, page_x, page_y, device_x, device_y)
    if success == 1:
        return device_x[0], device_y[0]
    return x0, y0

bbox_width = right - left
bbox_height = top - bottom
x0, y0 = _convert_bbox(page, left, bottom, page_width, page_height, rotation)
x1, y1 = x0 + bbox_width, y0 + bbox_height

But this does not seem to give correct rectangles for bboxes. Rotation definitely is rotating them, but the bboxes are different than those I get with the first version of the script.

What is the correct approach to extracting text from rotated documents?

mara004 · 2023-07-03T15:57:20Z

mara004
Jul 3, 2023
Maintainer

The bboxes are not in the regular PDF coordinate system (top left = 0, 0), so I change those to get a more typical x0, y0, x1, y0 coordinate system.

There are some mistakes in the above phrase:

The regular PDF coordinate system, if we can use that term, has its origin at the bottom left corner. Top left is commonly the origin of bitmaps. "regular" is difficult since the PDF coordinate system may theoretically be laid out between any opposite corners, but most PDFs comply with the convention.
Bounding boxes are given as 4 coordinates as stored in the document, with values relative to the PDF coordinate system. In PDF, rotation modifies the coordinate system, so raw PDF coordinates always refer to the non-rotated page.
Supposedly you mean x0 y0 x1 y1, not x0 y0 x1 y0.

If you only work with pypdfium2 functions, you needn't worry about coordinate conversion at all and can just use pdf coordinates as-is, i. e. you simply pass the result of get_rect() to get_text_bounded() unchanged and it will be correct.

If you want to draw to a bitmap, then yes, you will need to translate from PDF to bitmap coordinates. Either manually or via FPDF_PageToDevice(). However, this is currently your task as caller. This is mainly a bindings infrastructure project. For PDF background, refer to the specification. Note that there have been various similar reports in the past.

0 replies

mara004 · 2023-07-03T16:03:59Z

mara004
Jul 3, 2023
Maintainer

And an important note on FPDF_PageToDevice(): the rotation parameter is only for additional rotation on rendering, not page rotation, i.e. you should pass 0 here.

0 replies

giuqoob · 2023-07-03T16:30:36Z

giuqoob
Jul 3, 2023
Author

@mara004 First of all, I did search around for open and closed issues, my bad if I missed something. Also my bad for having a typo in the post.

bbox coordinates are required to sort the text into reading order - it is not uncommon that extracted text is returned out of reading sequence. Another typical example is skewed documents where simple logic like sort first by y-axis and then by x-axis is useless. bboxes need to be adjusted based on information of the document layout, before they can be sorted.

Only by having correct bbox-coordinates one can then sort the bboxes and words in them to get a coherent raw text output, even if the bboxes do capture the words perfectly. Additionally if one wishes to use the bboxes and words as ground truth for ML applications, the bboxes have to be converted into a more typical coordinate system where x0/y0 is the upper left corner of an image. COCO format is an example of this.

So if the document is rotated, it is much easier to rotate the bboxes first and then pass the output through code that takes care of sorting.

As far as I know, pypdfium2 currently ranks 2nd in terms of accuracy and speed of text extraction after PyMuPDF. Unlike PyMuPDF which is AGPL licensed, pypdfium2 is actually free & open source and a better choice imo if the scope of the project can change over time. Just a point of view as to the benefits of this library compared to other alternatives in the OSS space.

Here's an example: Red boxes are drawn by using bbox values as is, blue is after rotation.

0 replies

mara004 · 2023-07-03T17:46:22Z

mara004
Jul 3, 2023
Maintainer

First of all, I did search around for open and closed issues

Related issues are #228 and #214.

Thanks for elaborating on the use case. Don't misunderstand me, I agree that coordinate normalization is vital for some use cases. However, I don't currently have time/interest to dive into this topic, and there are higher priority tasks ATM (mainly consolidating tests and existing helpers).

FPDF_PageToDevice() can already translate coordinates, but the issue we had in #228 is that it only provides int output, resulting in a loss of precision, which is a problem if you do not target a fixed-size bitmap.
I think the best solution would be if pdfium could provide a new set of functions to normalize coordinates bitmap-independently with floating-point output, maybe even with configurable origin (bottom left or top left).

0 replies

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct way to deal with rotated documents for text extraction? #235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

This comment has been hidden.

Select a reply

Correct way to deal with rotated documents for text extraction? #235

giuqoob Jul 3, 2023

Replies: 5 comments

mara004 Jul 3, 2023 Maintainer

mara004 Jul 3, 2023 Maintainer

giuqoob Jul 3, 2023 Author

mara004 Jul 3, 2023 Maintainer

This comment has been hidden.

giuqoob
Jul 3, 2023

mara004
Jul 3, 2023
Maintainer

mara004
Jul 3, 2023
Maintainer

giuqoob
Jul 3, 2023
Author

mara004
Jul 3, 2023
Maintainer