How to extract character with coordinates from PDF using pypdfium2 ? #214

pranayamhatre175 · 2023-04-25T13:41:25Z

Hi,

I am trying to extract character and its coordinates using following way:

I am providing index by iterating textpage.count_chars().
to get coordinates I am using textpage.get_charbox(index)
to get character I am using textpage.get_text_range(index, count =1)

For some PDFs I am getting wrong coordinates for character using above method. Is I am using correct way to extract character and coordinates ? If yes then why I am getting wrong coordinates for characters in some PDFs ?

If I am using wrong way to extract character and coordinates then please provide right method for the same ?

mara004 · 2023-04-25T15:14:41Z

That sounds like correct usage. Concerning point 3, if you only want to extract single chars, you can also use FPDFText_GetUnicode() from pypdfium2.raw (raw pdfium API).

Can you share a minimal reproducible exapmle where you believe get_charbox() returns wrong coordinates?

In general, get_charbox() returns values for (left, bottom, right, top) in PDF canvas units.
Note that, when translating PDF coordinates into bitmap coordinates (and vice versa), you're not just done with translating origin (bottom left <-> top left), but a lot more needs to be taken into account (e. g. rotation, non-zero origin, ...). Read what the PDF spec says on coordinate system. I recommend using FPDF_PageToDevice() / FPDF_DeviceToPage() for that operation.

mara004 · 2023-04-26T16:12:24Z

Hmm, I browsed pdfium's bug tracker a bit, and by chance found the following two bug reports that seem related:
https://bugs.chromium.org/p/pdfium/issues/detail?id=1821
https://bugs.chromium.org/p/pdfium/issues/detail?id=1562

So possibly this might be an actual bug in pdfium. However, as mentioned above, a test file + reproducing code would be important.

pranayamhatre175 · 2023-05-04T15:26:55Z

Hi,
Sorry for late reply I have tested PDFs in which I have found above issue with your given solution i.e FPDFText_GetUnicode(). Problem resolved for those PDF but still I am testing this solution.

Is there any way to get words , sentences and there coordinates using pypdfium2 ? I have tried certain way but not able to get for all PDFs for some its working for some its not.

One more issue I am facing I have PDF with 0 orientation. but I am getting coordinates rotated by counterclockwise 90 degree using get_charbox(). Do you know what will be the issue ? Using pypdfium2 I have checked rotation its saying 90 degree.

mara004 · 2023-05-04T16:51:44Z

I have tested PDFs in which I have found above issue with your given solution i.e FPDFText_GetUnicode(). Problem resolved for those PDF but still I am testing this solution.

That's strange, textpage.get_text_range(index, count=1) should provide the same result as FPDFText_GetUnicode(), I thought this merely as a stylistic improvement, and believe it's impossible for this to fix your problem.

Is there any way to get words , sentences and there coordinates using pypdfium2 ? I have tried certain way but not able to get for all PDFs for some its working for some its not.

Well, there's the rectangle API (count_rects(), get_rect()) but to my understanding it may return anything between a few chars and a whole sentence, so it only groups chars by proximity, without a particular target.
Unfortunately, pdfium does not provide specific APIs to get words/sentences yet, but we recently got a similar request in #210, which resulted in a lengthy feature request upstream (https://crbug.com/pdfium/2025)

One more issue I am facing I have PDF with 0 orientation. but I am getting coordinates rotated by counterclockwise 90 degree using get_charbox(). Do you know what will be the issue ? Using pypdfium2 I have checked rotation its saying 90 degree.

Not sure if I'm interpreting your wording correctly, but I think the following: In PDF, rotation modifies the coordinate system, so coordinates refer to the non-rotated page. If you have a page defining rotation 90° and intend to get visual coordinates, you need to apply that rotation to the char box. This belongs to the pdf <-> bitmap coordinate conversion mentioned above.

(Note that rotation is only a structural value that does not tell how the content is actually oriented, i. e. you can get a page with rotation 90° but visually it may look upright.)

mara004 · 2023-05-21T11:25:47Z

Since there has been no response for 2 weeks, I'm closing this for now.
@pranayamhatre175 Feel free to continue on this thread if there are any questions left.

mara004 added needsinfo Additional information from the reporter is required question A user needs help or further information labels Apr 25, 2023

mara004 added the pdfium This issue may be caused by (or related to) pdfium itself label Apr 26, 2023

This comment was marked as outdated.

Sign in to view

mara004 closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2023

samshelley mentioned this issue Jun 19, 2023

Need coordinate conversion help #228

Closed

mara004 mentioned this issue Jul 3, 2023

Correct way to deal with rotated documents for text extraction? #234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract character with coordinates from PDF using pypdfium2 ? #214

How to extract character with coordinates from PDF using pypdfium2 ? #214

pranayamhatre175 commented Apr 25, 2023 •

edited

Loading

mara004 commented Apr 25, 2023 •

edited

Loading

mara004 commented Apr 26, 2023 •

edited

Loading

This comment was marked as outdated.

pranayamhatre175 commented May 4, 2023

mara004 commented May 4, 2023 •

edited

Loading

mara004 commented May 21, 2023 •

edited

Loading

How to extract character with coordinates from PDF using pypdfium2 ? #214

How to extract character with coordinates from PDF using pypdfium2 ? #214

Comments

pranayamhatre175 commented Apr 25, 2023 • edited Loading

mara004 commented Apr 25, 2023 • edited Loading

mara004 commented Apr 26, 2023 • edited Loading

This comment was marked as outdated.

pranayamhatre175 commented May 4, 2023

mara004 commented May 4, 2023 • edited Loading

mara004 commented May 21, 2023 • edited Loading

pranayamhatre175 commented Apr 25, 2023 •

edited

Loading

mara004 commented Apr 25, 2023 •

edited

Loading

mara004 commented Apr 26, 2023 •

edited

Loading

mara004 commented May 4, 2023 •

edited

Loading

mara004 commented May 21, 2023 •

edited

Loading