Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract character with coordinates from PDF using pypdfium2 ? #214

Closed
pranayamhatre175 opened this issue Apr 25, 2023 · 6 comments
Closed
Labels
needsinfo Additional information from the reporter is required pdfium This issue may be caused by (or related to) pdfium itself question A user needs help or further information

Comments

@pranayamhatre175
Copy link

pranayamhatre175 commented Apr 25, 2023

Hi,

I am trying to extract character and its coordinates using following way:

  1. I am providing index by iterating textpage.count_chars().
  2. to get coordinates I am using textpage.get_charbox(index)
  3. to get character I am using textpage.get_text_range(index, count =1)

For some PDFs I am getting wrong coordinates for character using above method. Is I am using correct way to extract character and coordinates ? If yes then why I am getting wrong coordinates for characters in some PDFs ?

If I am using wrong way to extract character and coordinates then please provide right method for the same ?

@mara004
Copy link
Member

mara004 commented Apr 25, 2023

That sounds like correct usage. Concerning point 3, if you only want to extract single chars, you can also use FPDFText_GetUnicode() from pypdfium2.raw (raw pdfium API).

Can you share a minimal reproducible exapmle where you believe get_charbox() returns wrong coordinates?

In general, get_charbox() returns values for (left, bottom, right, top) in PDF canvas units.
Note that, when translating PDF coordinates into bitmap coordinates (and vice versa), you're not just done with translating origin (bottom left <-> top left), but a lot more needs to be taken into account (e. g. rotation, non-zero origin, ...). Read what the PDF spec says on coordinate system. I recommend using FPDF_PageToDevice() / FPDF_DeviceToPage() for that operation.

@mara004 mara004 added needsinfo Additional information from the reporter is required question A user needs help or further information labels Apr 25, 2023
@mara004
Copy link
Member

mara004 commented Apr 26, 2023

Hmm, I browsed pdfium's bug tracker a bit, and by chance found the following two bug reports that seem related:
https://bugs.chromium.org/p/pdfium/issues/detail?id=1821
https://bugs.chromium.org/p/pdfium/issues/detail?id=1562

So possibly this might be an actual bug in pdfium. However, as mentioned above, a test file + reproducing code would be important.

@mara004 mara004 added the pdfium This issue may be caused by (or related to) pdfium itself label Apr 26, 2023
@mara004

This comment was marked as outdated.

@pranayamhatre175
Copy link
Author

Hi,
Sorry for late reply I have tested PDFs in which I have found above issue with your given solution i.e FPDFText_GetUnicode(). Problem resolved for those PDF but still I am testing this solution.

Is there any way to get words , sentences and there coordinates using pypdfium2 ? I have tried certain way but not able to get for all PDFs for some its working for some its not.

One more issue I am facing I have PDF with 0 orientation. but I am getting coordinates rotated by counterclockwise 90 degree using get_charbox(). Do you know what will be the issue ? Using pypdfium2 I have checked rotation its saying 90 degree.

@mara004
Copy link
Member

mara004 commented May 4, 2023

I have tested PDFs in which I have found above issue with your given solution i.e FPDFText_GetUnicode(). Problem resolved for those PDF but still I am testing this solution.

That's strange, textpage.get_text_range(index, count=1) should provide the same result as FPDFText_GetUnicode(), I thought this merely as a stylistic improvement, and believe it's impossible for this to fix your problem.

Is there any way to get words , sentences and there coordinates using pypdfium2 ? I have tried certain way but not able to get for all PDFs for some its working for some its not.

Well, there's the rectangle API (count_rects(), get_rect()) but to my understanding it may return anything between a few chars and a whole sentence, so it only groups chars by proximity, without a particular target.
Unfortunately, pdfium does not provide specific APIs to get words/sentences yet, but we recently got a similar request in #210, which resulted in a lengthy feature request upstream (https://crbug.com/pdfium/2025)

One more issue I am facing I have PDF with 0 orientation. but I am getting coordinates rotated by counterclockwise 90 degree using get_charbox(). Do you know what will be the issue ? Using pypdfium2 I have checked rotation its saying 90 degree.

Not sure if I'm interpreting your wording correctly, but I think the following: In PDF, rotation modifies the coordinate system, so coordinates refer to the non-rotated page. If you have a page defining rotation 90° and intend to get visual coordinates, you need to apply that rotation to the char box. This belongs to the pdf <-> bitmap coordinate conversion mentioned above.

(Note that rotation is only a structural value that does not tell how the content is actually oriented, i. e. you can get a page with rotation 90° but visually it may look upright.)

@mara004
Copy link
Member

mara004 commented May 21, 2023

Since there has been no response for 2 weeks, I'm closing this for now.
@pranayamhatre175 Feel free to continue on this thread if there are any questions left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needsinfo Additional information from the reporter is required pdfium This issue may be caused by (or related to) pdfium itself question A user needs help or further information
Projects
None yet
Development

No branches or pull requests

2 participants