Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

Closed
GerHobbelt opened this issue Nov 5, 2019 · 2 comments
Labels
🦸‍♀️enhancement🦸‍♂️ New feature or request 🕵investigate Needs further analysis to find the root cause. ⛷performance Anything that's related to UX: speed of response; I/O speed, etc. 👮wontfix This will not be worked on

Comments

@GerHobbelt
Copy link
Collaborator

The Sorax PDF render library seems to take longer and longer, the higher the page number to render in the PDF is.

Smells like we're facing a O(n^2) bad performance behaviour here, thanks to the way QiqqaOCR is working? (one page per invocation in SINGLE mode, which results in Sorax' apparent O(n) cost turning into O(1/2 n^2) thus O(n^2) performance hog at the Qiqqa level.

Would Sorax render costs drop when we simply grab all pages in the PDF at once and dump them to image files, to be Tesseract-OCR'd in another process?

@GerHobbelt GerHobbelt changed the title The Sorax PDF render library seems to take longer and longer, the higher the page number to render in the PDF is. QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. Nov 5, 2019
@GerHobbelt GerHobbelt added ⛷performance Anything that's related to UX: speed of response; I/O speed, etc. 🕵investigate Needs further analysis to find the root cause. 🦸‍♀️enhancement🦸‍♂️ New feature or request labels Nov 5, 2019
@GerHobbelt GerHobbelt added this to the Our Glorious Future milestone Nov 5, 2019
@GerHobbelt
Copy link
Collaborator Author

See also #135 comments: check the commandline parameters listed there.

Local dev machine sample test commandline for QiqqaOCR to observe this behaviour:

SINGLE "D:\Qiqqa\base\INTRANET_EF52564A-831D-42F2-B956-815CF0418C08\documents\9\90336D7E2FC44F01BE9A53E58CEB9E60B8A85.pdf" 1-100 "C:\Users\Ger\AppData\Local\Temp\\TempFile.4cb114d4-3461-4844-8345-e710bf1cb62b.txt" "" "" NOKILL

Note the page range in that commandline: this is available for debugging QiqqaOCR since commit SHA-1: a758657

@GerHobbelt GerHobbelt added the 👮wontfix This will not be worked on label Apr 27, 2020
@GerHobbelt
Copy link
Collaborator Author

Won't fix this one.

I'm rather more intent on kicking out Sorax and going with MuPDF (and MuPDFSharp) as that's open source instead of closed source, so easier to dive in and debug. Sorax, as far as I am concerned, is a dead end that's needed until I have plugged in MuPDF for page rendering.

Regarding slow speed problems of MuPDF: those are different, but do exist: https://bugs.ghostscript.com/show_bug.cgi?id=701945#c2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🦸‍♀️enhancement🦸‍♂️ New feature or request 🕵investigate Needs further analysis to find the root cause. ⛷performance Anything that's related to UX: speed of response; I/O speed, etc. 👮wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant