Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF OCR errors to investigate... #25

Closed
GerHobbelt opened this issue Aug 6, 2019 · 2 comments
Closed

PDF OCR errors to investigate... #25

GerHobbelt opened this issue Aug 6, 2019 · 2 comments
Labels
🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause.
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

I've seen all sorts of things go wrong with PDFs, but these should be dealt with more or less sensibly (as far as you can deal sensibly with a garbage input).

At least Qiqqa & QiqqaOCR MUST NOT crash, lock up or otherwise b0rk on bad input PDFs.

Debug log extract:

20190804.204323 ERROR [PDFTextExtractor] There was a problem while running OCR with parameters: GROUP "D:\Qiqqa\base\Guest\documents\9\97F3C6565FB76E8DF535D150CB43D08EC2E62517.pdf" 301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320 "C:\Users\Ger\AppData\Local\Temp\\TempFile.dd76b1ad-93ef-44fe-93e0-f5d0fa5f25ec.txt" "" ""
20190804.204323 INFO  [PDFTextExtractor] Parameters: GROUP "D:\Qiqqa\base\Guest\documents\9\97F3C6565FB76E8DF535D150CB43D08EC2E62517.pdf" 301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320 "C:\Users\Ger\AppData\Local\Temp\\TempFile.dd76b1ad-93ef-44fe-93e0-f5d0fa5f25ec.txt" "" ""
20190804.204323 INFO  [PDFTextExtractor] --- Standard output:
20190804.204322 INFO  [Main] Logging initialised
20190804.204322 INFO  [Main] Starting the text extract thread
20190804.204323 INFO  [Main] Both text extract and OCR have exited, so exiting
20190804.204323 ERROR [Main] There was an error in QiqqaOCR:
--- Parameters ---
GROUP D:\Qiqqa\base\Guest\documents\9\97F3C6565FB76E8DF535D150CB43D08EC2E62517.pdf 301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320 C:\Users\Ger\AppData\Local\Temp\\TempFile.dd76b1ad-93ef-44fe-93e0-f5d0fa5f25ec.txt   
--- Exception ---
System.Exception: We have no wordlist to write!
   at QiqqaOCR.TextExtractEngine.MainEntry(String[] args, Boolean no_kill) in W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\qiqqa\QiqqaOCR\TextExtractEngine.cs:line 111
   at QiqqaOCR.Program.Main(String[] args) in W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\qiqqa\QiqqaOCR\Program.cs:line 46


System.Exception: We have no wordlist to write!
   at QiqqaOCR.TextExtractEngine.MainEntry(String[] args, Boolean no_kill) in W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\qiqqa\QiqqaOCR\TextExtractEngine.cs:line 111
   at QiqqaOCR.Program.Main(String[] args) in W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\qiqqa\QiqqaOCR\Program.cs:line 46

--- Standard error:


@GerHobbelt
Copy link
Collaborator Author

Related: #74

@GerHobbelt GerHobbelt added 🐛bug Something isn't working 🤔question Further information is requested or this is a support question 🕵investigate Needs further analysis to find the root cause. and removed 🤔question Further information is requested or this is a support question labels Oct 4, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Oct 9, 2019
@GerHobbelt
Copy link
Collaborator Author

The last weeks have seen a lot of activity to make the code more roust. That part of this issue is therefore considered fixed.

The crashes that do happen still are due to corrupted PDFs and other PDF library related issues. The remainder of this issue is represented in #95.

Closing this as the work has been done dev repo for v82 release: https://github.com/GerHobbelt/qiqqa-open-source/tree/v82-build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause.
Projects
None yet
Development

No branches or pull requests

1 participant