-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remaining reproduction scripts #34
Conversation
Currently running |
Relevant output from
|
Seeing Globus errors in |
There was still a Globus error, so tried to update authentications using the script at E3SM-Project/zstash#302 (comment) (I believe doing a small zstash transfer would have had the same effect). This seems to be the recurring issue that the Globus Consents sometimes need to be entered manually. |
Still run into a Globus error. Attempting the fix at E3SM-Project/zstash#322 (comment) |
I'm not seeing the same error now and I see the auth code prompt in the output file. Re-running the fix from E3SM-Project/zstash#302 (comment) |
I added the following lines:
That gave:
So, I went to that directory and ran that command. It prompted me twice for auth codes. I then commented out those lines and re-ran the testing script. |
From #32 (comment): Line count test passed, Checksum test failed (3):
Line count test failed, Checksum test failed (5):
From running Line count test passed, Checksum test failed (8):
So, all tests now pass the line count test. We have 8 checksum failures to debug. |
Follow-up for the table in #32 (comment):
|
It appears the files are missing because the script couldn't find the reproduction scripts. And that is because I didn't run |
Latest results:
Both line count test and checksum test failed (5):
Line count test passed, checksum test failed (1):
Both line count test and checksum test passed (2):
|
@golaz Update on reproduction scripts: #32 added 11 reproduction scripts to E3SM Data docs. Based on the above results, we can add 2 more. Still, there are 6 reproduction scripts that are failing:
Then, there are the 52 remaining simulations listed on https://e3sm-project.github.io/e3sm_data_docs/_build/html/v2/reproducing_simulations.html, that don't seem to have been in my original script to generate reproduction scripts:
|
Re-running |
2 hours seemed sufficient for
|
Re: the remaining simulations. The following 9 simulations should also have reproduction scripts. The others listed above are either extra simulations or ones generated on Cori, which is no longer available.
|
Test script failed early again due to Globus issue. Appear to have fixed with:
|
Summary of remaining reproduction scripts to add:
From #34 (comment):
That's 13 more scripts. Also, as noted in #34 (comment), these 2 are good to add already:
|
That was apparently sufficient for 0101, but not 0301. Double 0301's walltime to 8 hours. 0101, like 0201, now only fails the checksum test:
|
It can be confusing keeping track of all the directories and scripts. For reference:
|
0301 finished. Again, checksum test failed. |
I'm still not seeing why those
|
Coming back to this after a while: Which reproduction scripts are ready to be included?I ran
The output is:
That means these 2 reproduction scripts are ready to be added to https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html:
That matches the last note in #34 (comment). Where do those expected checksums come from?They're listed on the "10 day checksum" column of https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html. But where do those come from? They're taken from the original simulation pages (specifically the "sanity checks" sections) which are linked from https://acme-climate.atlassian.net/wiki/spaces/ED/pages/2766340117/V2+Simulation+Planning. (The corresponding v3 page is linked on https://docs.e3sm.org/running-e3sm-guide/guide-long-term-archiving/#4-document). Full summary of remaining scripts(Update on #34 (comment))
*listed on https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html |
On Perlmutter
*In
|
Running:
Got Logged into Globus with NERSC credentials. Authenticated with LCRC credentials for the LCRC endpoint in the file manager. Rerunning
Still got
Got
Got
This is because
Now, we can try to find the expected checksum:
Added
to
It looks like |
Note to self -- the checksum may be matching up because the result checker is calculating the checksum on the newly extracted |
Why does the checksum match even though the number of days differs?
We actually needed to run And that hasn't been run yet because What's the status of all scripts?
*listed as added on the diff for this PR: https://github.com/E3SM-Project/e3sm_data_docs/pull/34/files (this implies So, we can see that all simulations in set (2) need to have Sets (2b) and (2c) are effectively one set now that permissions have been opened up. We will probably drop the set in (2a) since it's not archived on HPSS. I just ran
I'm now running
|
Confirming reproduction script generationAdded:
Modified:
(Recall that Thus, we've now generated the reproduction scripts for the 8 simulations of sets (2b-c). Testing new scriptsWe now have the actual checksums for these 8 reproduction scripts, to include in |
Extracting simulations as follows (as part of the process to find the expected checksums):
So, for the 3 |
To run the tests with the original scripts, we'll first need to find those. They are located in this repo. In theory, we'd just need to set |
I copied down
But I'm getting
Apparently I have no I believe setting
|
Debugging mismatched checksumsI thought perhaps the However, I ran So, we still don't have a clear answer on why the checksums aren't matching up. Generating expected checksumsFrom
Now,
This is our newly generated expected checksum. Now, we can check against the reproduction script's checksum. We add this line:
to That shows:
|
I wrote #39 to explain, in detail, the complicated process of producing reproduction scripts. Using the steps there, I'm able to produce a clearer picture of the status of the remaining reproduction scripts: Milestones:
Milestone (5) is independent of (1-4). All 5 must be completed to get to the point where we can check (6).
Note that initial conditions (milestone 3) might not actually be necessary if a simulation used another simulation's initial conditions. I now notice on https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html that The Failing the tests -- milestone (6)4 are failing the checksum test (and passing the line count test). It is very unclear why this is the case.
1 is failing both the checksum test and the line count test. It seems this is because
Need to find an expected checksum -- milestone (5)4 need to have an expected checksum calculated by running from scratch the original script's test on a 10-day period.
This will require running the original run script. It's possible we can get away without fetching the code again, but if we need to (e.g., a different branch is used), that would mean budgeting an hour per script for that. Need
|
|
I tried running Relatedly, it appears my simultaneous running of the modified original run script of |
The re-running of I originally couldn't compute the expected checksum for
Running |
All the newer scripts pass the tests. At this point, we can add the following 7 reproduction scripts officially:
These 4 reproduction scripts need to be debugged to get the checksum tests passing:
It may be good at this point to merge in the 7 working reproduction scripts (and also update the table on the website), and address the remaining 4 in another pull request. |
8ef7a46
to
a50e60d
Compare
a50e60d
to
bc6e292
Compare
Going to merge the 7 working reproduction scripts. |
Add remaining reproduction scripts. Follow-up to #32, #33. Resolves #23.