validate is slow or runs out of memory when validating a bundle #826

kbowley-asu · 2024-02-12T23:43:31Z

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When I attempted to validate the labels for a bundle using --skip-content-validation with ~95k items it get's slower and slower the longer it runs. We needed to make a custom script to start validate in order to not run out of heap memory.

🕵️ Expected behavior

I expected runs in a predicable amount of time and not run out of heap memory while doing it.

📜 To Reproduce

Have a large dataset being prepared for release
run validate with rule pds4.bundle and --skip-content-validation (because ain't nobody got time for that)
watch the world burn.. or at least get depressed while staring at deadlines
...

🖥 Environment Info

Version of this software 3.4.1
Operating System: Linux and OpenJDK 11.0.21
...

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

lylehuber · 2024-02-13T18:21:37Z

Same problem here. Even running it in background, it hogs all the CPU. (Except I hadn't turned off content validation but wasn't using pds4.bundle.)

jordanpadams · 2024-02-13T20:52:43Z

@kbowley-asu @lylehuber apologies for the performance degradation. We had several updates included in the latest version of validate that greatly improves functionality for referential integrity checking, but may have negatively impacted performance. We will do some performance benchmarking and hopefully find and resolve the issue. In the meantime, you may want to increase the Java memory allocation in the validate or validate.bat files. See the "java.lang.OutOfMemoryError" section on our Common Error documentation for information on how to update the memory: https://nasa-pds.github.io/validate/operate/errors.html

gravatite · 2024-02-13T21:01:05Z

For reference, we've been tracking rough numbers as validation progresses by tailing the report file and occasionally noting the wall time. Here's a plot of the slowdown we're seeing as it progresses through the 95k products in the bundle.

lylehuber · 2024-02-16T18:15:31Z

BTW, I had already boosted the Java memory allocation to "%JAVA_HOME%\bin\java" -Xms2048m -Xmx4096m ..."

al-niessner · 2024-02-22T18:01:37Z

@jordanpadams

Do we have a profiling tool? I still have 12000 size bundle which may show where data is being stored and maybe the slowing (most likely a list that is iterated over and over slowing as array gets big).

jordanpadams · 2024-02-22T18:07:56Z

@alexdunnjpl I believe you have been doing some profiling on harvest/registry-mgr? Is there a tool you are using?

alexdunnjpl · 2024-02-22T18:14:31Z

@jordanpadams that was on sweepers and was (amenable to being) a bit more primitive than proper profiling.

@al-niessner I'd recommend looking at JetBrains' built-in profiling tools if you don't have something else in mind.

https://blog.jetbrains.com/idea/2021/05/get-started-with-profiling-in-intellij-idea/
https://www.jetbrains.com/help/idea/profiler-intro.html (this summary page links to the actual info)

jordanpadams · 2024-02-22T18:24:17Z

@al-niessner another one from @nutjob4life: https://visualvm.github.io/

al-niessner · 2024-02-28T18:08:51Z

Found two memory leaks. One is fixed and greatly improves the performance of validate in terms of memory and CPU (slowing as more items are processed). The second leak is more insidious and will take a while to root out as it is part of xerces or our incorrect use of it more likely. It is a linear growth that is directly proportional to the size and number of XML files processed. Discovering how to make the now dead objects available to the garbage collector is going to take some time. I will note that the remaining leak eats up about 1 MB per XML file processed for my dataset but does not cause the processing time to slow down as it accumulates.

jordanpadams · 2024-02-28T19:43:36Z

@al-niessner great news! if you want to toss up a PR for the first leak, we can start with that.

al-niessner · 2024-02-28T21:50:41Z

Turns out, without much surprise, that they are related. So LabelValidator was the source of the first problem because after running schematron check it needed to do a reset() on the schematron checker otherwise it remembered its past taking longer (doing more) and accumulating cruft. The second is that one or many of the various LabelValidator or LocationValidator is holding the billions of document fragments in memory because the validators are singletons. Trying to break that now so that the garbage collector can do its job.

al-niessner · 2024-02-29T16:53:18Z

@jordanpadams

Okay, finally ran this to ground and the answer is not a happy one. It seems that the schematron tools from apache really want to be released rather than cached. All the memory leaks are related to the cached schematron validators. Despite calling reset() on them, they still hoard (maybe horde too) 10s of thousands of document fragments as they process through weakrefs. Since LocationValidator and LabelValidationRule tag team to hold these references active, it may be necessary to not cache or reuse the schematron validators. I will do a second sweep to make sure I found all of the places where the validators are called and add more reset() but it is more likely that we will need to remove the caching - to those lurking, the caching is to prevent downloading the schema each time that in turn means may speed up from fixing memory but slow down because network is being slaughtered. Can cover in detail during today's breakout if desired.

Maybe we download schematron to memory as byte array then feed it to new validators every iteration to reduce number of downloads and allow the validator to disgorge the document fragments.

lylehuber · 2024-02-29T16:58:55Z

I always run against a local copy of the schema and schematron files so that I don't have to worry about millions of files being validated needing to find schemas across the internet.

gravatite · 2024-02-29T17:06:22Z

In our case, it's less than 2 MiB of xsd & sch files needed (we already cache them on the local filesystem to avoid a ton of web requests when our pipeline validates individual products as they're being made to keep any reprocessing efforts from annoying the PDS web servers). Trading 2 MiB of RAM for cache to avoid the leak/slowdown that makes our bundle validate take 25 GiB of RAM and run for >40 hours certainly gets our vote. 🙃

al-niessner · 2024-02-29T19:08:38Z

@jordanpadams

Was able to move to caching the document and recreating schematron validators on the fly. Yesterday, at 300 labels memory usage was at 300 MB. Now, 50 MB. Calling it done.

jordanpadams · 2024-02-29T19:15:54Z

@al-niessner awesome! how was the speed with recreating schematron validators? hopefully not significantly impacted?

gravatite · 2024-02-29T19:16:57Z

Is there a pre-release build somewhere that we could try out on the ShadowCam bundle?

al-niessner · 2024-02-29T19:29:11Z

@jordanpadams

Not too bad as best I can tell. I do not have a base line nor did I write anything down, but yesterday it seemed to be 400 labels every 15 minutes. Now 300. I am still chugging through 12000 of them in my test bundle and at 650 after 30 minutes. Some go faster than others so hard to really say. I have one fixed data point here but I do not remember what flags I used for that run (maybe ignored content) but it is the same bundle. Therefore it will give me some kind of estimate. I may have to restart it shortly because this laptop needs to be geo-relocated but it run all night after it moves again.

At this point, the best you can do to speed things up is do labels in parallel. Doing that will be a giant problem given how many singletons there are to maintain global state. Still, it would be doable in a couple of months. Probably cannot do it and maintain the log order, but might be able to keep labels from interleaving.

al-niessner · 2024-02-29T19:35:37Z

At 1000 labels in 50 minutes and still 50 MB.

jordanpadams · 2024-02-29T20:15:08Z

@al-niessner awesome. thanks for tracking this down. this will do for now.

per the parallel execution, the problem we ran into when we tried this many years ago was it appears that the JAVA XML libraries are not thread safe. So would need to get creative on figuring all that out.

al-niessner · 2024-02-29T21:08:04Z

4500 (135 min) labels and 65 MB. There is growth from saving data in the TargetRegistrar for checks later on like all references covered etc. Should be constant processing time from my measurements - started at 20 labels/min but this reading is more like 30/min so must have had some quicker ones.

jordanpadams · 2024-09-09T17:39:22Z

Skipping I&T. This is very hard to test, and developers have done rigorous regression testing here.

kbowley-asu added bug Something isn't working needs:triage labels Feb 12, 2024

kbowley-asu assigned jordanpadams Feb 12, 2024

jordanpadams added B15.0 s.high s.medium and removed needs:triage s.high labels Feb 13, 2024

jordanpadams assigned al-niessner and unassigned jordanpadams Feb 13, 2024

jordanpadams added B14.1 and removed B15.0 labels Feb 13, 2024

jordanpadams mentioned this issue Feb 20, 2024

Fix Performance and Content Validation Regressions #833

Closed

jordanpadams added B15.0 and removed B14.1 labels Feb 20, 2024

jordanpadams added this to B15.0 Feb 20, 2024

github-project-automation bot moved this to Release Backlog in B15.0 Feb 20, 2024

al-niessner mentioned this issue Feb 29, 2024

Fix memory leaks and update PDF/A algorithm for non-document products #845

Merged

jordanpadams added the sprint-backlog label Feb 29, 2024

jordanpadams closed this as completed in #845 Mar 2, 2024

github-project-automation bot moved this from Release Backlog to 🏁 Done in B15.0 Mar 2, 2024

jordanpadams removed the sprint-backlog label Aug 6, 2024

daniel-rincon-garcia mentioned this issue Aug 23, 2024

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

Closed

jordanpadams added the i&t.skip label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate is slow or runs out of memory when validating a bundle #826

validate is slow or runs out of memory when validating a bundle #826

kbowley-asu commented Feb 12, 2024

lylehuber commented Feb 13, 2024 •

edited

Loading

jordanpadams commented Feb 13, 2024

gravatite commented Feb 13, 2024

lylehuber commented Feb 16, 2024

al-niessner commented Feb 22, 2024

jordanpadams commented Feb 22, 2024

alexdunnjpl commented Feb 22, 2024

jordanpadams commented Feb 22, 2024

al-niessner commented Feb 28, 2024

jordanpadams commented Feb 28, 2024

al-niessner commented Feb 28, 2024

al-niessner commented Feb 29, 2024 •

edited

Loading

lylehuber commented Feb 29, 2024

gravatite commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Feb 29, 2024

gravatite commented Feb 29, 2024

al-niessner commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Sep 9, 2024

validate is slow or runs out of memory when validating a bundle #826

validate is slow or runs out of memory when validating a bundle #826

Comments

kbowley-asu commented Feb 12, 2024

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

lylehuber commented Feb 13, 2024 • edited Loading

jordanpadams commented Feb 13, 2024

gravatite commented Feb 13, 2024

lylehuber commented Feb 16, 2024

al-niessner commented Feb 22, 2024

jordanpadams commented Feb 22, 2024

alexdunnjpl commented Feb 22, 2024

jordanpadams commented Feb 22, 2024

al-niessner commented Feb 28, 2024

jordanpadams commented Feb 28, 2024

al-niessner commented Feb 28, 2024

al-niessner commented Feb 29, 2024 • edited Loading

lylehuber commented Feb 29, 2024

gravatite commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Feb 29, 2024

gravatite commented Feb 29, 2024

al-niessner commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Feb 29, 2024

al-niessner commented Feb 29, 2024

jordanpadams commented Sep 9, 2024

lylehuber commented Feb 13, 2024 •

edited

Loading

al-niessner commented Feb 29, 2024 •

edited

Loading