Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate not retrying on sch load failure #903

Closed
rgdeen opened this issue May 21, 2024 · 5 comments · Fixed by #907
Closed

Validate not retrying on sch load failure #903

rgdeen opened this issue May 21, 2024 · 5 comments · Fixed by #907
Assignees

Comments

@rgdeen
Copy link

rgdeen commented May 21, 2024

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

I was trying to run validate on MSAM2 using a wrapper around "parallel" to run on sol (directory) at a time. I kept getting fatal errors loading the schematron files. Not every run failed, so it's not a permissions/access/syntax issue, but most of them did.

[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | wc
855 5402 86785
[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | sort | uniq
FATAL_ERROR [error.label.schematron] Cannot read schematron from URL https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.sch
FATAL_ERROR [error.label.unresolvable_resource] Broken pipe
FATAL_ERROR [error.label.unresolvable_resource] Received close_notify during handshake
FATAL_ERROR [error.label.unresolvable_resource] Remote host terminated the handshake
[rgd@pdsimg-int1 msam2]$

It should have been running about 16 at once via parallel.

Note that it proceeded to validate the files... but without the PDS core sch file (this is related to a prior issue of mine where it silently doesn't tell you if a sch is missing from a local dir ... this is not a local dir, but same effect).

🕵️ Expected behavior

I expected it to work, of course. ;-} But more realistically, it would be good to retry failures like this a few times before reporting on it and giving up.

📜 To Reproduce

Here's the invoking scripts:

[rgd@pdsimg-int1 msam2]$ more run_val.csh
#!/bin/csh

rm jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/calibration >>jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/document >>jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/miscellaneous >>jobs.txt
find pwd/annex_ehlmann_caltech_msl_msam2/*/sol -maxdepth 1 -mindepth 1 | sort >>jobs.txt

cat jobs.txt | parallel -j 200% --joblog run_val.joblog "./doit.csh {}" >& run_val.log

[rgd@pdsimg-int1 msam2]$ more doit.csh
#!/bin/csh

set path = (/mnt/pdsdata/scratch/rgd/msam2/jdk-17.0.11/bin/ $path)

echo "Validating $1"

/mnt/pdsdata/scratch/rgd/msam2/validate-3.4.1/bin/validate -target $1


I don't know how reproducible this is. I had it occur on two separate days. But I do not know if it is dependent on the machine on which I was running, or the specifics of the data.

🖥 Environment Info

I was using the JPL-IMG on-prem machine "pdsimg-analytics".

[rgd@pdsimg-analytics msam2]$ uname -a
Linux pdsimg-analytics 3.10.0-1160.80.1.el7.x86_64 #1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

As you can see from the command lines above, it's JDK 17.0.11 and Validate 3.4.1 .

📚 Version of Software Used

validate 3.4.1

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

@rgdeen
Copy link
Author

rgdeen commented May 22, 2024

Oddly, I reran the same thing on a different machine today (pdsimg-int1) and did not have the same errors. So I don't know if it has something to do with that machine (which is admittedly configured weird) or if something on the server cleared up since last night.

Nevertheless, the request would be to retry the load a couple times before quitting. I just have no idea how you'd reproduce the issue for testing.

@al-niessner
Copy link
Contributor

@jordanpadams

I verified in the code that validate does not retry. Given the inordinate number of successes except when the host site is down, not sure retry is of great value here. The actual failure was the schematron host killing the connection partway through the handshake. I get that trying again things might just work out. Again, the inordinate number of successes suggest that failures are just that and will likely do the same on a repeat. The inordinate success rate also suggests some lower level below validate itself is robustly trying to avoid most common snags but not failures.

It would pretty easy to do retries. If set it at 3 and it fails do then make 5? Then 17? When does retries become a failure especially given, as I have already pointed out, the inordinate success rate.

Just for those lurking, I claim the inordinate success rate because the unit testing loads from the network hosts tens to hundreds of schematron. Those unit tests have run thousands of times now. They are so reliable at loading that we expect not hope that they download without a problem. The once or twice when the host site was down were the obvious exceptions.

@jordanpadams
Copy link
Member

@al-niessner let's do 3 retries. I agree this may not fix the issue, but there is a chance considering the issue was encountered running it in parallel, versus our tests which are synchronous.

@rgdeen
Copy link
Author

rgdeen commented May 23, 2024

Certainly 3 would eliminate most if not all transient errors. If It still fails after that it's probably a hard failure.

@jordanpadams
Copy link
Member

Skipping I&T. Sporadic errors that are hard to test, and a low severity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

4 participants