-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: No jobs completing successfully on zppy 2.4.0rc1
#561
Comments
Example errors:
|
I'm essentially certain the size was in gigabytes before... I think the data did indeed get scrubbed. |
@xylar @chengzhuzhang This needs to be resolved before we can release @chengzhuzhang I suppose this presents a good opportunity to convert the tests over to testing on v3 data rather than the older v2 simulation data. That said, it would probably be quicker to try to restore the v2 data to Chrysalis by transferring from the HPSS archive. |
@forsyth2, it is not going to work to store testing data in your own scratch space. I store very small test runs under diagnostics/mpas_analysys and they get synced to other machines. Many GB is like too much, though. We would need to arrange a space with @rljacob that doesn't get scrubbed but also doesn't get synced for larger testing data. |
@xylar Oh I see. Well actually, syncing would be quite useful because we want to run the tests on chrysalis, compy, and Perlmutter. |
@forsyth2, how much data are we talking about? |
I had the same question. Right now we can just globus transfer over one copy from NERSC disk. In the feature, we may need to keep the testing data somewhere can be excepted from scrubbing. |
@xylar So the data is also on Perlmutter for testing there:
So, 24 terabytes. |
So, not gigabyte range, but certainly more than the 17M post-scrubbing on LCRC. |
That is about 20 times the size of the diagnostics folder so unfortunately that's not acceptable. I think you both will need to come up with another space to use and figure out how to handle scrubbing. |
@forsyth2, can you find a smaller dataset to use for testing or use a smaller subset of the |
@xylar I think it is less critical to support these testing data in |
Oh I got the size ordering (terabyte > gigabyte) mixed up. Yes, terabyte is quite large, I see.
Yes, I suppose this is feasible, since we only test on ~10 years of data. It just seemed complicated to try to delete any data associated with the remainder of the time period. I figured we had the space, so why not just have it all? But I see now, we do not in fact have the space. |
24 TBs is quite a lot of space so, no, I don't think we have that much space to spare in general. |
Ok, I'll get through testing of this RC and then work on reducing the test data size. |
Ok I got the data transferred back to Chrysalis (6 hours on Globus) and ran the tests successfully on I'll plan to reduce the testing data size when we do v3 (#552). We shouldn't be testing on v2 for too much longer anyway. |
What happened?
No jobs are completing successfully on
zppy 2.4.0rc1
. My best guess is that simulation input data was accidentally removed in the most recent LCRC scrubbing (March 6, targeting data 2 years or older).After I got extra variables to plot for global time series (#400), I wanted to make sure the original plots still worked. So, I made a cfg with the original simulation input and var list. I encountered a "Missing input files" error on the
ts
task. I was thinking it was related to thenc
files appearing in quotes for some reason (see #400 (comment)). But then I looked at the MPAS error which wasFileNotFoundError: [Errno 2] No such file or directory: '/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201/run/mpaso_in'
. That seemed odd, since I had changed nothing whatsoever with MPAS Analysis.I then checked out a branch identical to
main
and ran the complete_run test. Nothing passed.What machine were you running on?
Chrysalis
Environment
Dev environment:
zppy_dev_20240322
What command did you run?
Copy your cfg file
What jobs are failing?
What stack trace are you encountering?
No response
The text was updated successfully, but these errors were encountered: