Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate is slow or runs out of memory when validating a bundle #826

Closed
kbowley-asu opened this issue Feb 12, 2024 · 22 comments · Fixed by #845
Closed

validate is slow or runs out of memory when validating a bundle #826

kbowley-asu opened this issue Feb 12, 2024 · 22 comments · Fixed by #845
Assignees
Labels

Comments

@kbowley-asu
Copy link

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When I attempted to validate the labels for a bundle using --skip-content-validation with ~95k items it get's slower and slower the longer it runs. We needed to make a custom script to start validate in order to not run out of heap memory.

🕵️ Expected behavior

I expected runs in a predicable amount of time and not run out of heap memory while doing it.

📜 To Reproduce

  1. Have a large dataset being prepared for release
  2. run validate with rule pds4.bundle and --skip-content-validation (because ain't nobody got time for that)
  3. watch the world burn.. or at least get depressed while staring at deadlines
    ...

🖥 Environment Info

  • Version of this software 3.4.1
  • Operating System: Linux and OpenJDK 11.0.21
    ...

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

@lylehuber
Copy link

lylehuber commented Feb 13, 2024

Same problem here. Even running it in background, it hogs all the CPU. (Except I hadn't turned off content validation but wasn't using pds4.bundle.)

@jordanpadams
Copy link
Member

@kbowley-asu @lylehuber apologies for the performance degradation. We had several updates included in the latest version of validate that greatly improves functionality for referential integrity checking, but may have negatively impacted performance. We will do some performance benchmarking and hopefully find and resolve the issue. In the meantime, you may want to increase the Java memory allocation in the validate or validate.bat files. See the "java.lang.OutOfMemoryError" section on our Common Error documentation for information on how to update the memory: https://nasa-pds.github.io/validate/operate/errors.html

@gravatite
Copy link

For reference, we've been tracking rough numbers as validation progresses by tailing the report file and occasionally noting the wall time. Here's a plot of the slowdown we're seeing as it progresses through the 95k products in the bundle.
product_validation_rate_vs_products_validated

@lylehuber
Copy link

BTW, I had already boosted the Java memory allocation to "%JAVA_HOME%\bin\java" -Xms2048m -Xmx4096m ..."

@al-niessner
Copy link
Contributor

@jordanpadams

Do we have a profiling tool? I still have 12000 size bundle which may show where data is being stored and maybe the slowing (most likely a list that is iterated over and over slowing as array gets big).

@jordanpadams
Copy link
Member

@alexdunnjpl I believe you have been doing some profiling on harvest/registry-mgr? Is there a tool you are using?

@alexdunnjpl
Copy link
Contributor

@jordanpadams that was on sweepers and was (amenable to being) a bit more primitive than proper profiling.

@al-niessner I'd recommend looking at JetBrains' built-in profiling tools if you don't have something else in mind.

https://blog.jetbrains.com/idea/2021/05/get-started-with-profiling-in-intellij-idea/
https://www.jetbrains.com/help/idea/profiler-intro.html (this summary page links to the actual info)

@jordanpadams
Copy link
Member

@al-niessner another one from @nutjob4life: https://visualvm.github.io/

@al-niessner
Copy link
Contributor

Found two memory leaks. One is fixed and greatly improves the performance of validate in terms of memory and CPU (slowing as more items are processed). The second leak is more insidious and will take a while to root out as it is part of xerces or our incorrect use of it more likely. It is a linear growth that is directly proportional to the size and number of XML files processed. Discovering how to make the now dead objects available to the garbage collector is going to take some time. I will note that the remaining leak eats up about 1 MB per XML file processed for my dataset but does not cause the processing time to slow down as it accumulates.

@jordanpadams
Copy link
Member

@al-niessner great news! if you want to toss up a PR for the first leak, we can start with that.

@al-niessner
Copy link
Contributor

Turns out, without much surprise, that they are related. So LabelValidator was the source of the first problem because after running schematron check it needed to do a reset() on the schematron checker otherwise it remembered its past taking longer (doing more) and accumulating cruft. The second is that one or many of the various LabelValidator or LocationValidator is holding the billions of document fragments in memory because the validators are singletons. Trying to break that now so that the garbage collector can do its job.

@al-niessner
Copy link
Contributor

al-niessner commented Feb 29, 2024

@jordanpadams

Okay, finally ran this to ground and the answer is not a happy one. It seems that the schematron tools from apache really want to be released rather than cached. All the memory leaks are related to the cached schematron validators. Despite calling reset() on them, they still hoard (maybe horde too) 10s of thousands of document fragments as they process through weakrefs. Since LocationValidator and LabelValidationRule tag team to hold these references active, it may be necessary to not cache or reuse the schematron validators. I will do a second sweep to make sure I found all of the places where the validators are called and add more reset() but it is more likely that we will need to remove the caching - to those lurking, the caching is to prevent downloading the schema each time that in turn means may speed up from fixing memory but slow down because network is being slaughtered. Can cover in detail during today's breakout if desired.

Maybe we download schematron to memory as byte array then feed it to new validators every iteration to reduce number of downloads and allow the validator to disgorge the document fragments.

@lylehuber
Copy link

I always run against a local copy of the schema and schematron files so that I don't have to worry about millions of files being validated needing to find schemas across the internet.

@gravatite
Copy link

In our case, it's less than 2 MiB of xsd & sch files needed (we already cache them on the local filesystem to avoid a ton of web requests when our pipeline validates individual products as they're being made to keep any reprocessing efforts from annoying the PDS web servers). Trading 2 MiB of RAM for cache to avoid the leak/slowdown that makes our bundle validate take 25 GiB of RAM and run for >40 hours certainly gets our vote. 🙃

@al-niessner
Copy link
Contributor

@jordanpadams

Was able to move to caching the document and recreating schematron validators on the fly. Yesterday, at 300 labels memory usage was at 300 MB. Now, 50 MB. Calling it done.

@jordanpadams
Copy link
Member

@al-niessner awesome! how was the speed with recreating schematron validators? hopefully not significantly impacted?

@gravatite
Copy link

Is there a pre-release build somewhere that we could try out on the ShadowCam bundle?

@al-niessner
Copy link
Contributor

@jordanpadams

Not too bad as best I can tell. I do not have a base line nor did I write anything down, but yesterday it seemed to be 400 labels every 15 minutes. Now 300. I am still chugging through 12000 of them in my test bundle and at 650 after 30 minutes. Some go faster than others so hard to really say. I have one fixed data point here but I do not remember what flags I used for that run (maybe ignored content) but it is the same bundle. Therefore it will give me some kind of estimate. I may have to restart it shortly because this laptop needs to be geo-relocated but it run all night after it moves again.

At this point, the best you can do to speed things up is do labels in parallel. Doing that will be a giant problem given how many singletons there are to maintain global state. Still, it would be doable in a couple of months. Probably cannot do it and maintain the log order, but might be able to keep labels from interleaving.

@al-niessner
Copy link
Contributor

At 1000 labels in 50 minutes and still 50 MB.

@jordanpadams
Copy link
Member

@al-niessner awesome. thanks for tracking this down. this will do for now.

per the parallel execution, the problem we ran into when we tried this many years ago was it appears that the JAVA XML libraries are not thread safe. So would need to get creative on figuring all that out.

@al-niessner
Copy link
Contributor

4500 (135 min) labels and 65 MB. There is growth from saving data in the TargetRegistrar for checks later on like all references covered etc. Should be constant processing time from my measurements - started at 20 labels/min but this reading is more like 30/min so must have had some quicker ones.

@jordanpadams
Copy link
Member

Skipping I&T. This is very hard to test, and developers have done rigorous regression testing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

6 participants