Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shared library error in tc_analysis #180

Closed
forsyth2 opened this issue Jan 14, 2022 · 18 comments · Fixed by #631
Closed

Fix shared library error in tc_analysis #180

forsyth2 opened this issue Jan 14, 2022 · 18 comments · Fixed by #631
Assignees
Labels
Non-reproducible bug Bug that can't be reproduced consistently

Comments

@forsyth2
Copy link
Collaborator

There is a GenerateConnectivityFile: error while loading shared libraries: libnetcdf.so.11: cannot open shared object file: No such file or directory error when multiple years_sets are run simultaneously.

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Jan 14, 2022
@forsyth2 forsyth2 self-assigned this Jan 14, 2022
@forsyth2
Copy link
Collaborator Author

There appears to be a related concurrency/parallelism issue with E3SM Diags using TC Analysis: socket.gaierror: [Errno -2] Name or service not known [...] During handling of the above exception, another exception occurred: [...] urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

@forsyth2
Copy link
Collaborator Author

Got GenerateConnectivityFile: error while loading shared libraries: libnetcdf.so.11: cannot open shared object file: No such file or directory on running integration tests for #187, which includes the code from #169 that made tc-analysis sequential. That implies it wasn't actually the parallel runs causing the problem.

@forsyth2
Copy link
Collaborator Author

Also note that if you set

[tc_analysis]
active = True
years = "1:20:20", "1:50:50",

and tc_analysis_0001-0020.status doesn't exist but tc_analysis_0001-0050.status does, then zppy -c <config> will only run tc_analysis_0001-0020. If you then delete tc_analysis_0001-0050.status and then rerun zppy -c <config>, then tc_analysis_0001-0050 will run in parallel with tc_analysis_0001-0020 (the former will not depend on the latter finishing).

@chengzhuzhang
Copy link
Collaborator

Hey Ryan, I can also try troubleshooting, can you provide information on how to reproduce the problem?

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang Thanks! So the issue is that the error appears somewhat random. My concern is that I have forced tc_analysis to run sequentially when it's possible running in parallel wasn't even the issue.

How you can test/debug:

  • Create config file:
[default]
input = <any E3SM simulation input>
input_subdir = archive/atm/hist                            
output = <output>
case = <case>
www = <www>
partition = debug
ref_start_yr = 1979
ref_final_yr = 2016
environment_commands = "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"

[tc_analysis]
active = True
years = "1:20:20" # Can make this `years = "1:20:20", "1:50:50",` to experiment with running multiple tasks.
scratch = "/lcrc/globalscratch/<your scratch dir>"
  • Load zppy dev environment or just use the source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.6.0rc4_chrysalis.sh
  • zppy -c <config file>
  • See if error occurs in <output>/post/scripts/tc_analysis_0001-0020 log file.

If you want the full list of steps I did, here they are:

  • Do pre-integration-tests zppy run on the branch for Fix ylim #187. The code required tc_analysis to run sequentially
  • tc_analysis_0001-0020 failed with GenerateConnectivityFile: error while loading shared libraries: libnetcdf.so.11: cannot open shared object file: No such file or directory (despite not running in parallel!)
  • Due to that failure, tc_analysis_0001-0050 didn't run (nor did either of the subsequent e3sm_diags tasks)
  • I forgot to delete the status files for those jobs (they were waiting for a dependency that failed)
  • I resubmitted the jobs and only tc_analysis_0001-0020 reran.
  • I deleted the status files I forgot about, and reran zppy. tc_analysis_0001-0050 began immediately even though tc_analysis_0001-0020 was still running.
  • Both tc_analysis tasks finished successfully despite running in parallel
  • I'm not sure, but I believe this means the error is random and not actually related to running in parallel. This is not great because it means our best work-around (short of finding/fixing the bug) is to tell users to just re-run if they encounter the error.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jan 28, 2022

I got an error parsing years: ValueError: Error interpreting years 1:10:10" # Can make this years = "1:20:20`..
My bad, I should remove the comment.
But after that I got:

zppy -c tc_analysis.cfg 
Validation results={'default': True, 'tc_analysis': {'qos': True, 'nodes': True, 'walltime': True, 'years': False, 'input_files': True}, 'climo': True, 'ts': True, 'e3sm_diags': True, 'e3sm_diags_vs_model': True, 'amwg': True, 'mpas_analysis': True, 'global_time_series': True}
Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.6.0rc4_nompi/bin/zppy", line 10, in <module>
    sys.exit(main())
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.6.0rc4_nompi/lib/python3.9/site-packages/zppy/__main__.py", line 51, in main
    _validate_config(config)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.6.0rc4_nompi/lib/python3.9/site-packages/zppy/__main__.py", line 125, in _validate_config
    raise Exception("Configuration file validation failed")
Exception: Configuration file validation failed

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang Can you point me to your config file?

@chengzhuzhang
Copy link
Collaborator

Yes, /home/ac.zhang40/test_zppy/tc_analysis.cfg on Chrysalis.

@forsyth2
Copy link
Collaborator Author

Thanks, it says /home/ac.zhang40/test_zppy/: Permission denied though

@chengzhuzhang
Copy link
Collaborator

Hey Ryan, after correcting my .cfg file. I tried testing different parameters:
Using one set of years vs two sets of years,
and latest e3sm_unified vs latest e3sm_unified release candidate.
But won't be able to reproduce the error.
For the integration test, does it write intermediate files on scratch, I recall there was occasional hiccups on Cori, when i tried to write out intermediate files to cfs. I'm wondering does this error happen all the time or randomly?

@forsyth2
Copy link
Collaborator Author

Thanks, I will try a few more runs. My guess is the error occurs randomly... which makes debugging a challenge.

Yes it writes intermediate files on scratch. So, do you think writing on scratch is the problem or is that a different problem?

@forsyth2
Copy link
Collaborator Author

I have run the following config file in four configurations: each combination of (running with zppy dev environment / running with E3SM Unified 1.6.0rc4 environment) x (one year-set / multiple year-sets). However, I could not replicate the error in any of the 4 configurations.

[default]
input = /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
output = /lcrc/group/e3sm/ac.forsyth2/zppy_development/issue_180.v2.LR.historical_0201
case = v2.LR.historical_0201
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_development/issue180/
partition = debug
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.6.0rc4_chrysalis.sh"
input_subdir = archive/atm/hist
ref_start_yr = 1979
ref_final_yr = 2016

[tc_analysis]
active = True
years = "1985:2004:20",
scratch = "/lcrc/globalscratch/ac.forsyth2"

@chengzhuzhang
Copy link
Collaborator

I think writing to scratch always worked for me on Cori and Chrysalis. Since the error can't be replicated at this moment, I guess we will have to keep an eye out for now..

@xylar
Copy link
Contributor

xylar commented Feb 7, 2022

@forsyth2, do you have a way of dumping out all of your environment variables (env) right before the place you're seeing this error? It seems as if your conda environment isn't in the LD_LIBRARY_PATH. This could happen because of something in your .bashrc.

@forsyth2 forsyth2 added Non-reproducible bug Bug that can't be reproduced consistently and removed semver: bug Bug fix (will increment patch version) labels Jun 20, 2022
@chengzhuzhang
Copy link
Collaborator

Maybe we should consider close this issue if it is not reproduced later?

@forsyth2
Copy link
Collaborator Author

Ok. I'll run a TC analysis test in parallel to be sure, and if there are no errors, I guess we can close this issue.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Oct 15, 2024

Thanks! If I understand correctly, there was a workaround to force tc_analysis tasks and e3sm_diags tasks to run sequentially due to socket.gaierror: [Errno -2] Name or service not known [...] During handling of the above exception, another exception occurred: [...] urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>, which seems to be also an intermittent error. The workaround is causing the issue number 2 here. If socket.gaierror can't be reproduced, then we can perhaps remove the workaround?

@forsyth2
Copy link
Collaborator Author

Yeah, that seems reasonable. I'm also wondering if your changes in E3SM-Project/e3sm_diags#824 (notably the file name changes) may help as well (i.e., by not directing parallel processes to changing the same data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Non-reproducible bug Bug that can't be reproduced consistently
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants