Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qiqqa crashing when generating autotags #283

Open
quissicks opened this issue Jan 3, 2021 · 7 comments
Open

qiqqa crashing when generating autotags #283

quissicks opened this issue Jan 3, 2021 · 7 comments
Labels
🐛bug Something isn't working
Milestone

Comments

@quissicks
Copy link

quissicks commented Jan 3, 2021

Happy New Year to the qiqqa community! I have a very large library. I am running version V82.0.7579.33985. It crashed when I try to generate autotags.

@GerHobbelt
Copy link
Collaborator

GerHobbelt commented Jan 3, 2021 via email

@GerHobbelt
Copy link
Collaborator

Hi Chris,

Finally took time to inspect your logfiles earlier today. Still going through them as there's some other stuff in there that hints of other trouble. Anyway, we'll get to that.

What I can see for the logfiles, the root cause is (with high probability) the auto tag processing (and not something happening in the background that "just happens at the same time"). The outofmem failure happens inside the LuceneNET library code as this library code is busy updating the search index with the new autotags which are attached to each document. (The LuceneNET search index processes all PDF document texts plus all PDF text-based metadata (tags, BibTeX, title, etc.)

Thank you very much for sending the bundled logfiles; I'll bother you with a few more requests if that's okay:

Aside

it's not related to this issue but I noticed a bunch of PDFs producing 'irregular' log output during OCR/text background processing for the search index updates, which translates to:

  • you seem to have several PDFs in your collection which would be good to have as test cases for further Qiqqa PDF/OCR work; these would then end up in the large github test set repository at https://github.com/GerHobbelt/Evil-PDF-Library-for-Qiqqa
  • some of those 'irregular' PDFs produce a 'nil' output, which means Qiqqa has been unable to extract any text from them or no text from a limited set of pages (which can legally happen when those pages only carry graphics, such as charts, photos or schematics)
  • at time of writing there was one(1) very unexpected out-of-page-range request "which should never have happened": it's not harmful, but is indicative of a PDF apparently triggering some faulty internal behaviour that I haven't seen before and needs looking into.

I'd like to have a look at those PDFs when time allows, if that's okay.


Back to the issue at hand

The short end of the problem at hand is that I don't have a quick fix for it right now.

Memory management in .NET applications isn't easy stuff; I'm considering how to tackle this sooner than my intended end result: Qiqqa in 64 bit with upgraded libraries. (#289, section "How much .NET memory is gobbled up by the Lucene search databases in current Qiqqa?")

From what I can see so far is the problem is caused by all the LuceneNET activity resulting from the set of AutoTags discovered and assigned to the documents. 🤔 Thinking about how to approach this problem and reduce the memory pressure in the application.

Current questions for you (@quissicks)

  • What's the total number of documents in your libraries?

    No need to add it to the last item, but rather a range like 'between 40K and 42K documents': I'm wondering if my own libraries are sufficiently large to be useful for testing the issue you're experiencing or whether I need to build a larger library to help inspect memory pressure in .NET.

  • The request for particular PDFs will follow later as a single batch to keep that separate.

@GerHobbelt GerHobbelt added this to the v82 milestone Jan 10, 2021
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Jan 13, 2021
…en* exactly these out-of-bounds requests occur - as this was discovered in customer log files during problem analysis of jimmejardine#283
@GerHobbelt
Copy link
Collaborator

@quissicks : Hi Chris,

There's a new (test) release published at https://github.com/GerHobbelt/qiqqa-open-source/releases/tag/v83.0.7649.30836 ; see description there. You can simply install it over your existing Qiqqa; if you want to revert to another Qiqqa version, you can install that version over the new one without trouble.

See also the last comment at #288 (the other issue this release is targetting) and the screenshot of the startup dialog there: not meant for your situation, so only an awareness bit. In your case I'm particularly interested in the new log files; regrettably I haven't been able to do something seriously about memory reduction yet: I have a few observations, also from my own testing, but it's pretty tough to pinpoint the culprits (well, technically more accurate is saying the culprits are easily found in a memory profiler but the big hurdle is coming up with ways to alleviate the memory pressure there: it's all the documents, which load their metadata into memory at the first "opportunity" where such is needed (e.g. when analyzing metadata in the background for auto-tagging, checking the indexing, etc.etc.) and then Qiqqa isn't smart about it and doesn't know how to, say, "throw away" these datums when the acute need for them has gone. Plus there's the curious observation in my own tests that 'apparently' there are more PDF document 'instances' in memory than I have PDF documents in all the libraries, so that's another ho-hum-hum to research: that one has to be tested with a very small library (or set of libraries) to see if I can reproduce that 'too many' situation then and find out where it originates -- doing that in a huge lib is a too cumbersome.

Anyway, just so you get a bit of feel for what's seen and know that work is being done, only I cannot predict results yet as I'm still in the 'finding out what's going exactly phase, while also realizing that there's some serious refactoring required if I must detect high memory pressure and 'discard' old-ish metadata -- which isn't timestamped yet as these are all persistent stores, not 'caches' in the usual sense, where stuff comes in, gets a timestamp that's tracked and refreshed based on usage and then killed off when the cached stuff 'expires'.

No matter, ignore if that's too geeky for you 😅

Have a go at the new version if you like and I'ld be happy to see another set of logfiles. Thanks!

By The Way

Apologies for any 'rough edges' with the new one; pushed the release out so it's here today and not, say, friday or later. Real life and all that jazz. Ciao!

@quissicks
Copy link
Author

quissicks commented Jan 13, 2021 via email

@GerHobbelt
Copy link
Collaborator

Quick heads up: new release to try: https://github.com/GerHobbelt/qiqqa-open-source/releases/tag/v83.0.7655.37537

Please report anything you observe with the new release. Thanks!

@quissicks
Copy link
Author

quissicks commented Jan 16, 2021 via email

@GerHobbelt
Copy link
Collaborator

Quick heads up: hotfix release to try: https://github.com/GerHobbelt/qiqqa-open-source/releases/tag/v83.0.7656.6401 (which fixes known issue in previous release https://github.com/GerHobbelt/qiqqa-open-source/releases/tag/v83.0.7655.37537)

Please report anything you observe with the new release. Thanks!

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Jan 17, 2021
…I thread; the lib uses COM under the hood, which requires a working and accessible Windows message pipe, something which only the UI thread can provide.

- littered the code with WPFDoEvents UI/not-UI assertions -- which caught the above scenario in a Dispose() for a page image render. And that was the hint the needed to progress a little further towards stibility: it was SORAX which caused a *lot* of the out-of-memory failures due to crazy COM/WPF/UI failures, even for smaller libraries under test.
- fix bit of an odd crash in the Lucene flush/cleanup during shutdown, where Lucene kept busy with 'optimizing the index' while a quick application termination was happening in the background, resulting in lockup and then a crash.
- this MAY be a fix for the reported "number of documents reported not matching reality": added update/refresh code to update the library list panel when PDF documents are added in the background via FolderWatcher or other means (async library loading). WARNING: this code is still incomplete/buggy!
- most UI assertions have been covered now. Keeping them anyway as this is hairy stuff and should be tested more.

Addresses (but is not guaranteed to fix) jimmejardine#290, jimmejardine#283, jimmejardine#281, jimmejardine#280, jimmejardine#243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants