Validate not retrying on sch load failure #903

rgdeen · 2024-05-21T17:04:22Z

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

I was trying to run validate on MSAM2 using a wrapper around "parallel" to run on sol (directory) at a time. I kept getting fatal errors loading the schematron files. Not every run failed, so it's not a permissions/access/syntax issue, but most of them did.

[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | wc
855 5402 86785
[rgd@pdsimg-int1 msam2]$ grep FATAL_ERROR run_val.log | sort | uniq
FATAL_ERROR [error.label.schematron] Cannot read schematron from URL https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.sch
FATAL_ERROR [error.label.unresolvable_resource] Broken pipe
FATAL_ERROR [error.label.unresolvable_resource] Received close_notify during handshake
FATAL_ERROR [error.label.unresolvable_resource] Remote host terminated the handshake
[rgd@pdsimg-int1 msam2]$

It should have been running about 16 at once via parallel.

Note that it proceeded to validate the files... but without the PDS core sch file (this is related to a prior issue of mine where it silently doesn't tell you if a sch is missing from a local dir ... this is not a local dir, but same effect).

🕵️ Expected behavior

I expected it to work, of course. ;-} But more realistically, it would be good to retry failures like this a few times before reporting on it and giving up.

📜 To Reproduce

Here's the invoking scripts:

[rgd@pdsimg-int1 msam2]$ more run_val.csh
#!/bin/csh

rm jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/calibration >>jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/document >>jobs.txt
echo pwd/annex_ehlmann_caltech_msl_msam2/miscellaneous >>jobs.txt
find pwd/annex_ehlmann_caltech_msl_msam2/*/sol -maxdepth 1 -mindepth 1 | sort >>jobs.txt

cat jobs.txt | parallel -j 200% --joblog run_val.joblog "./doit.csh {}" >& run_val.log

[rgd@pdsimg-int1 msam2]$ more doit.csh
#!/bin/csh

set path = (/mnt/pdsdata/scratch/rgd/msam2/jdk-17.0.11/bin/ $path)

echo "Validating $1"

/mnt/pdsdata/scratch/rgd/msam2/validate-3.4.1/bin/validate -target $1

I don't know how reproducible this is. I had it occur on two separate days. But I do not know if it is dependent on the machine on which I was running, or the specifics of the data.

🖥 Environment Info

I was using the JPL-IMG on-prem machine "pdsimg-analytics".

[rgd@pdsimg-analytics msam2]$ uname -a
Linux pdsimg-analytics 3.10.0-1160.80.1.el7.x86_64 #1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

As you can see from the command lines above, it's JDK 17.0.11 and Validate 3.4.1 .

📚 Version of Software Used

validate 3.4.1

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

The text was updated successfully, but these errors were encountered:

rgdeen · 2024-05-22T01:31:29Z

Oddly, I reran the same thing on a different machine today (pdsimg-int1) and did not have the same errors. So I don't know if it has something to do with that machine (which is admittedly configured weird) or if something on the server cleared up since last night.

Nevertheless, the request would be to retry the load a couple times before quitting. I just have no idea how you'd reproduce the issue for testing.

al-niessner · 2024-05-23T17:03:12Z

@jordanpadams

I verified in the code that validate does not retry. Given the inordinate number of successes except when the host site is down, not sure retry is of great value here. The actual failure was the schematron host killing the connection partway through the handshake. I get that trying again things might just work out. Again, the inordinate number of successes suggest that failures are just that and will likely do the same on a repeat. The inordinate success rate also suggests some lower level below validate itself is robustly trying to avoid most common snags but not failures.

It would pretty easy to do retries. If set it at 3 and it fails do then make 5? Then 17? When does retries become a failure especially given, as I have already pointed out, the inordinate success rate.

Just for those lurking, I claim the inordinate success rate because the unit testing loads from the network hosts tens to hundreds of schematron. Those unit tests have run thousands of times now. They are so reliable at loading that we expect not hope that they download without a problem. The once or twice when the host site was down were the obvious exceptions.

jordanpadams · 2024-05-23T21:34:22Z

@al-niessner let's do 3 retries. I agree this may not fix the issue, but there is a chance considering the issue was encountered running it in parallel, versus our tests which are synchronous.

rgdeen · 2024-05-23T21:36:50Z

Certainly 3 would eliminate most if not all transient errors. If It still fails after that it's probably a hard failure.

jordanpadams · 2024-09-24T21:23:52Z

Skipping I&T. Sporadic errors that are hard to test, and a low severity.

rgdeen added bug Something isn't working needs:triage labels May 21, 2024

rgdeen assigned jordanpadams May 21, 2024

rgdeen added this to EN Portfolio Backlog May 21, 2024

github-project-automation bot moved this to Backlog in EN Portfolio Backlog May 21, 2024

jordanpadams added s.low B15.0 and removed needs:triage labels May 23, 2024

jordanpadams assigned al-niessner and unassigned jordanpadams May 23, 2024

jordanpadams added this to B15.0 May 23, 2024

github-project-automation bot moved this to Release Backlog in B15.0 May 23, 2024

jordanpadams mentioned this issue May 23, 2024

Review Handling of Special Constants, Field Formats, and High Priority Bug Fixes #832

Closed

al-niessner mentioned this issue May 24, 2024

Add retries to schematron downloads in the event of network connectivity issues #907

Merged

jordanpadams added the sprint-backlog label May 24, 2024

pdsen-ci added open.3.5.0 open.3.5.1 labels May 24, 2024

jordanpadams removed the open.3.5.0 label May 25, 2024

jordanpadams closed this as completed in #907 May 28, 2024

github-project-automation bot moved this from Release Backlog to 🏁 Done in B15.0 May 28, 2024

github-project-automation bot moved this from ToDo to 🏁 Done in EN Portfolio Backlog May 28, 2024

jordanpadams removed the sprint-backlog label Aug 6, 2024

jordanpadams added the i&t.skip label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate not retrying on sch load failure #903

Validate not retrying on sch load failure #903

rgdeen commented May 21, 2024

rgdeen commented May 22, 2024

al-niessner commented May 23, 2024

jordanpadams commented May 23, 2024

rgdeen commented May 23, 2024

jordanpadams commented Sep 24, 2024

Validate not retrying on sch load failure #903

Validate not retrying on sch load failure #903

Comments

rgdeen commented May 21, 2024

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

🎉 Integration & Test

rgdeen commented May 22, 2024

al-niessner commented May 23, 2024

jordanpadams commented May 23, 2024

rgdeen commented May 23, 2024

jordanpadams commented Sep 24, 2024