-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate is slow or runs out of memory when validating a bundle #826
Comments
Same problem here. Even running it in background, it hogs all the CPU. (Except I hadn't turned off content validation but wasn't using pds4.bundle.) |
@kbowley-asu @lylehuber apologies for the performance degradation. We had several updates included in the latest version of validate that greatly improves functionality for referential integrity checking, but may have negatively impacted performance. We will do some performance benchmarking and hopefully find and resolve the issue. In the meantime, you may want to increase the Java memory allocation in the |
BTW, I had already boosted the Java memory allocation to "%JAVA_HOME%\bin\java" -Xms2048m -Xmx4096m ..." |
Do we have a profiling tool? I still have 12000 size bundle which may show where data is being stored and maybe the slowing (most likely a list that is iterated over and over slowing as array gets big). |
@alexdunnjpl I believe you have been doing some profiling on harvest/registry-mgr? Is there a tool you are using? |
@jordanpadams that was on sweepers and was (amenable to being) a bit more primitive than proper profiling. @al-niessner I'd recommend looking at JetBrains' built-in profiling tools if you don't have something else in mind. https://blog.jetbrains.com/idea/2021/05/get-started-with-profiling-in-intellij-idea/ |
@al-niessner another one from @nutjob4life: https://visualvm.github.io/ |
Found two memory leaks. One is fixed and greatly improves the performance of validate in terms of memory and CPU (slowing as more items are processed). The second leak is more insidious and will take a while to root out as it is part of xerces or our incorrect use of it more likely. It is a linear growth that is directly proportional to the size and number of XML files processed. Discovering how to make the now dead objects available to the garbage collector is going to take some time. I will note that the remaining leak eats up about 1 MB per XML file processed for my dataset but does not cause the processing time to slow down as it accumulates. |
@al-niessner great news! if you want to toss up a PR for the first leak, we can start with that. |
Turns out, without much surprise, that they are related. So LabelValidator was the source of the first problem because after running schematron check it needed to do a reset() on the schematron checker otherwise it remembered its past taking longer (doing more) and accumulating cruft. The second is that one or many of the various LabelValidator or LocationValidator is holding the billions of document fragments in memory because the validators are singletons. Trying to break that now so that the garbage collector can do its job. |
Okay, finally ran this to ground and the answer is not a happy one. It seems that the schematron tools from apache really want to be released rather than cached. All the memory leaks are related to the cached schematron validators. Despite calling reset() on them, they still hoard (maybe horde too) 10s of thousands of document fragments as they process through weakrefs. Since LocationValidator and LabelValidationRule tag team to hold these references active, it may be necessary to not cache or reuse the schematron validators. I will do a second sweep to make sure I found all of the places where the validators are called and add more reset() but it is more likely that we will need to remove the caching - to those lurking, the caching is to prevent downloading the schema each time that in turn means may speed up from fixing memory but slow down because network is being slaughtered. Can cover in detail during today's breakout if desired. Maybe we download schematron to memory as byte array then feed it to new validators every iteration to reduce number of downloads and allow the validator to disgorge the document fragments. |
I always run against a local copy of the schema and schematron files so that I don't have to worry about millions of files being validated needing to find schemas across the internet. |
In our case, it's less than 2 MiB of xsd & sch files needed (we already cache them on the local filesystem to avoid a ton of web requests when our pipeline validates individual products as they're being made to keep any reprocessing efforts from annoying the PDS web servers). Trading 2 MiB of RAM for cache to avoid the leak/slowdown that makes our bundle validate take 25 GiB of RAM and run for >40 hours certainly gets our vote. 🙃 |
Was able to move to caching the document and recreating schematron validators on the fly. Yesterday, at 300 labels memory usage was at 300 MB. Now, 50 MB. Calling it done. |
@al-niessner awesome! how was the speed with recreating schematron validators? hopefully not significantly impacted? |
Is there a pre-release build somewhere that we could try out on the ShadowCam bundle? |
Not too bad as best I can tell. I do not have a base line nor did I write anything down, but yesterday it seemed to be 400 labels every 15 minutes. Now 300. I am still chugging through 12000 of them in my test bundle and at 650 after 30 minutes. Some go faster than others so hard to really say. I have one fixed data point here but I do not remember what flags I used for that run (maybe ignored content) but it is the same bundle. Therefore it will give me some kind of estimate. I may have to restart it shortly because this laptop needs to be geo-relocated but it run all night after it moves again. At this point, the best you can do to speed things up is do labels in parallel. Doing that will be a giant problem given how many singletons there are to maintain global state. Still, it would be doable in a couple of months. Probably cannot do it and maintain the log order, but might be able to keep labels from interleaving. |
At 1000 labels in 50 minutes and still 50 MB. |
@al-niessner awesome. thanks for tracking this down. this will do for now. per the parallel execution, the problem we ran into when we tried this many years ago was it appears that the JAVA XML libraries are not thread safe. So would need to get creative on figuring all that out. |
4500 (135 min) labels and 65 MB. There is growth from saving data in the TargetRegistrar for checks later on like all references covered etc. Should be constant processing time from my measurements - started at 20 labels/min but this reading is more like 30/min so must have had some quicker ones. |
Skipping I&T. This is very hard to test, and developers have done rigorous regression testing here. |
Checked for duplicates
No - I haven't checked
🐛 Describe the bug
When I attempted to validate the labels for a bundle using --skip-content-validation with ~95k items it get's slower and slower the longer it runs. We needed to make a custom script to start validate in order to not run out of heap memory.
🕵️ Expected behavior
I expected runs in a predicable amount of time and not run out of heap memory while doing it.
📜 To Reproduce
...
🖥 Environment Info
...
📚 Version of Software Used
No response
🩺 Test Data / Additional context
No response
🦄 Related requirements
No response
⚙️ Engineering Details
No response
The text was updated successfully, but these errors were encountered: