-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate not retrying on sch load failure #903
Comments
Oddly, I reran the same thing on a different machine today (pdsimg-int1) and did not have the same errors. So I don't know if it has something to do with that machine (which is admittedly configured weird) or if something on the server cleared up since last night. Nevertheless, the request would be to retry the load a couple times before quitting. I just have no idea how you'd reproduce the issue for testing. |
I verified in the code that validate does not retry. Given the inordinate number of successes except when the host site is down, not sure retry is of great value here. The actual failure was the schematron host killing the connection partway through the handshake. I get that trying again things might just work out. Again, the inordinate number of successes suggest that failures are just that and will likely do the same on a repeat. The inordinate success rate also suggests some lower level below validate itself is robustly trying to avoid most common snags but not failures. It would pretty easy to do retries. If set it at 3 and it fails do then make 5? Then 17? When does retries become a failure especially given, as I have already pointed out, the inordinate success rate. Just for those lurking, I claim the inordinate success rate because the unit testing loads from the network hosts tens to hundreds of schematron. Those unit tests have run thousands of times now. They are so reliable at loading that we expect not hope that they download without a problem. The once or twice when the host site was down were the obvious exceptions. |
@al-niessner let's do 3 retries. I agree this may not fix the issue, but there is a chance considering the issue was encountered running it in parallel, versus our tests which are synchronous. |
Certainly 3 would eliminate most if not all transient errors. If It still fails after that it's probably a hard failure. |
Skipping I&T. Sporadic errors that are hard to test, and a low severity. |
Checked for duplicates
No - I haven't checked
🐛 Describe the bug
I was trying to run validate on MSAM2 using a wrapper around "parallel" to run on sol (directory) at a time. I kept getting fatal errors loading the schematron files. Not every run failed, so it's not a permissions/access/syntax issue, but most of them did.
[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | wc
855 5402 86785
[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | sort | uniq
FATAL_ERROR [error.label.schematron] Cannot read schematron from URL https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.sch
FATAL_ERROR [error.label.unresolvable_resource] Broken pipe
FATAL_ERROR [error.label.unresolvable_resource] Received close_notify during handshake
FATAL_ERROR [error.label.unresolvable_resource] Remote host terminated the handshake
[rgd@pdsimg-int1 msam2]$
It should have been running about 16 at once via parallel.
Note that it proceeded to validate the files... but without the PDS core sch file (this is related to a prior issue of mine where it silently doesn't tell you if a sch is missing from a local dir ... this is not a local dir, but same effect).
🕵️ Expected behavior
I expected it to work, of course. ;-} But more realistically, it would be good to retry failures like this a few times before reporting on it and giving up.
📜 To Reproduce
Here's the invoking scripts:
[rgd@pdsimg-int1 msam2]$ more run_val.csh
#!/bin/csh
rm jobs.txt
echo
pwd
/annex_ehlmann_caltech_msl_msam2/calibration >>jobs.txtecho
pwd
/annex_ehlmann_caltech_msl_msam2/document >>jobs.txtecho
pwd
/annex_ehlmann_caltech_msl_msam2/miscellaneous >>jobs.txtfind
pwd
/annex_ehlmann_caltech_msl_msam2/*/sol -maxdepth 1 -mindepth 1 | sort >>jobs.txtcat jobs.txt | parallel -j 200% --joblog run_val.joblog "./doit.csh {}" >& run_val.log
[rgd@pdsimg-int1 msam2]$ more doit.csh
#!/bin/csh
set path = (/mnt/pdsdata/scratch/rgd/msam2/jdk-17.0.11/bin/ $path)
echo "Validating $1"
/mnt/pdsdata/scratch/rgd/msam2/validate-3.4.1/bin/validate -target $1
I don't know how reproducible this is. I had it occur on two separate days. But I do not know if it is dependent on the machine on which I was running, or the specifics of the data.
🖥 Environment Info
I was using the JPL-IMG on-prem machine "pdsimg-analytics".
[rgd@pdsimg-analytics msam2]$ uname -a
Linux pdsimg-analytics 3.10.0-1160.80.1.el7.x86_64 #1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
As you can see from the command lines above, it's JDK 17.0.11 and Validate 3.4.1 .
📚 Version of Software Used
validate 3.4.1
🩺 Test Data / Additional context
No response
🦄 Related requirements
🦄 #xyz
⚙️ Engineering Details
No response
🎉 Integration & Test
No response
The text was updated successfully, but these errors were encountered: