-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extract character with coordinates from PDF using pypdfium2 ? #214
Comments
That sounds like correct usage. Concerning point 3, if you only want to extract single chars, you can also use Can you share a minimal reproducible exapmle where you believe In general, |
Hmm, I browsed pdfium's bug tracker a bit, and by chance found the following two bug reports that seem related: So possibly this might be an actual bug in pdfium. However, as mentioned above, a test file + reproducing code would be important. |
This comment was marked as outdated.
This comment was marked as outdated.
Hi, Is there any way to get words , sentences and there coordinates using pypdfium2 ? I have tried certain way but not able to get for all PDFs for some its working for some its not. One more issue I am facing I have PDF with 0 orientation. but I am getting coordinates rotated by counterclockwise 90 degree using get_charbox(). Do you know what will be the issue ? Using pypdfium2 I have checked rotation its saying 90 degree. |
That's strange,
Well, there's the rectangle API (
Not sure if I'm interpreting your wording correctly, but I think the following: In PDF, rotation modifies the coordinate system, so coordinates refer to the non-rotated page. If you have a page defining rotation 90° and intend to get visual coordinates, you need to apply that rotation to the char box. This belongs to the pdf <-> bitmap coordinate conversion mentioned above. (Note that rotation is only a structural value that does not tell how the content is actually oriented, i. e. you can get a page with rotation 90° but visually it may look upright.) |
Since there has been no response for 2 weeks, I'm closing this for now. |
Hi,
I am trying to extract character and its coordinates using following way:
For some PDFs I am getting wrong coordinates for character using above method. Is I am using correct way to extract character and coordinates ? If yes then why I am getting wrong coordinates for characters in some PDFs ?
If I am using wrong way to extract character and coordinates then please provide right method for the same ?
The text was updated successfully, but these errors were encountered: