QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

GerHobbelt · 2019-11-05T20:41:07Z

The Sorax PDF render library seems to take longer and longer, the higher the page number to render in the PDF is.

Smells like we're facing a O(n^2) bad performance behaviour here, thanks to the way QiqqaOCR is working? (one page per invocation in SINGLE mode, which results in Sorax' apparent O(n) cost turning into O(1/2 n^2) thus O(n^2) performance hog at the Qiqqa level.

Would Sorax render costs drop when we simply grab all pages in the PDF at once and dump them to image files, to be Tesseract-OCR'd in another process?

GerHobbelt · 2019-11-05T20:53:50Z

See also #135 comments: check the commandline parameters listed there.

Local dev machine sample test commandline for QiqqaOCR to observe this behaviour:

SINGLE "D:\Qiqqa\base\INTRANET_EF52564A-831D-42F2-B956-815CF0418C08\documents\9\90336D7E2FC44F01BE9A53E58CEB9E60B8A85.pdf" 1-100 "C:\Users\Ger\AppData\Local\Temp\\TempFile.4cb114d4-3461-4844-8345-e710bf1cb62b.txt" "" "" NOKILL

Note the page range in that commandline: this is available for debugging QiqqaOCR since commit SHA-1: a758657

GerHobbelt · 2020-04-27T17:42:28Z

Won't fix this one.

I'm rather more intent on kicking out Sorax and going with MuPDF (and MuPDFSharp) as that's open source instead of closed source, so easier to dive in and debug. Sorax, as far as I am concerned, is a dead end that's needed until I have plugged in MuPDF for page rendering.

Regarding slow speed problems of MuPDF: those are different, but do exist: https://bugs.ghostscript.com/show_bug.cgi?id=701945#c2

GerHobbelt changed the title ~~The Sorax PDF render library seems to take longer and longer, the higher the page number to render in the PDF is.~~ QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. Nov 5, 2019

GerHobbelt added ⛷performance Anything that's related to UX: speed of response; I/O speed, etc. 🕵investigate Needs further analysis to find the root cause. 🦸‍♀️enhancement🦸‍♂️ New feature or request labels Nov 5, 2019

GerHobbelt added this to the Our Glorious Future milestone Nov 5, 2019

GerHobbelt added the 👮wontfix This will not be worked on label Apr 27, 2020

GerHobbelt closed this as completed Apr 27, 2020

GerHobbelt mentioned this issue Apr 27, 2020

Kick out Sorax closed source commercial library which is not updated unless I put down $$$ for them. #209

Closed

This was referenced Feb 6, 2021

"unable to open database file", intranet library, is read only? #257

Open

Cant highlight text #295

Closed

GerHobbelt mentioned this issue Feb 27, 2021

Qiqqa error pops up "unexpected problem in qiqqa" v83.0.7656.6401 - I sent you zipped logs to email #304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

GerHobbelt commented Nov 5, 2019

GerHobbelt commented Nov 5, 2019

GerHobbelt commented Apr 27, 2020

QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

QiqqaOCR: The Sorax PDF render library seems to take longer and longer, the higher the page number to render is. #136

Comments

GerHobbelt commented Nov 5, 2019

GerHobbelt commented Nov 5, 2019

GerHobbelt commented Apr 27, 2020