Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERS.f19_g16.I1850CLM45.edison_intel.clm-betr fails on edison with threads and intel15 (the default) #1555

Closed
ndkeen opened this issue May 24, 2017 · 9 comments
Assignees
Labels

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 24, 2017

ERS.f19_g16.I1850CLM45.edison_intel.clm-betr
This test looks new and may have always failed. I assumed it was due to an easy-to-adjust runtime setting. However, I've been trying a few things and thought I should share my experiences.

The error sorta sounds like it might be running out of memory. So I tried increasing and decreasing the OMP_STACKSIZE from the current setting of 64M. I tried 32,64,128,256,512,1024M and all fail.

This test uses 96 MPI's and 4 threads, so on edison it needs 16 nodes. If I run the test forcing 1 thread it passes. If I try 2 threads, it also fails.

The test passes with --compiler=gnu

I also tried the --compiler=intel17 option and I haven't seen it behave differently. There are several other tests with what look like similar grids (all using 16 nodes on edison) that pass.
Whoops -- looks like I thought I was testing with intel17, but it's not actually using intel17 -- still 15.

Looks like specifying --compiler=intel17 isn't working with stand-alone create_test (I tested this with acme_developer)

edison11% create_test ERS.f19_g16.I1850CLM45.edison_intel.clm-betr --force-threads=1 --compiler=intel17 -ttest
No handlers could be found for logger "CIME.utils"
Using project from .cesm_proj: acme
Creating test directory /global/cscratch1/sd/ndk/acme_scratch/edison/n24may19/ERS_PMx1.f19_g16.I1850CLM45.edison_intel.clm-betr.test

If it worked, the edison_intel would instead be edison_intel17

The problem is that I was specifying the compiler by using "_intel" in the test name, which will override my --compiler setting. The better way to try the test with intel17 is to create_test ERS.f19_g16.I1850CLM45.edison_intel17.clm-betr. Trying that now -- and the test passes with intel17.

@ndkeen ndkeen added the Edison label May 24, 2017
@rgknox
Copy link
Contributor

rgknox commented May 24, 2017

@ndkeen I am trying to get fates to run on multiple threads as well, so I have been following along. If I figure anything out on my side, I will share my experience.

For now, can we verify that ERS.f19_g16.I1850CLM45.edison_intel.clm-betr indeed passed tests once upon a time? Maybe we should look at what happened when it stopped passing this one.

FYI: My problem has something to do with counting the number of total columns with natural vegetation during initialization, using that to allocate IO space, and then the ordering of the column indices for natural vegetation.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 24, 2017

I looked thru past runs of acme_developer and some had passed. It was the ones were I used intel17. As noted above, I thought I tried intel17, but the test did not. I've got a few jobs in the Q now that will verify if using intel17 "fixes" the problem.

@rgknox
Copy link
Contributor

rgknox commented May 24, 2017

ok, got it. Is it possible that those tests that passed didn't use multi-threading as a default on those grids? (and has since changed?)

@ndkeen
Copy link
Contributor Author

ndkeen commented May 24, 2017

Well it looks like a very recent test.

@ndkeen ndkeen changed the title ERS.f19_g16.I1850CLM45.edison_intel.clm-betr fails on edison (one of the tests) with threads ERS.f19_g16.I1850CLM45.edison_intel.clm-betr fails on edison with threads and intel15 (the default) May 24, 2017
@ndkeen
Copy link
Contributor Author

ndkeen commented May 24, 2017

Ok, it's passing when I use intel17 compiler.

@rgknox
Copy link
Contributor

rgknox commented May 24, 2017

did you end up specifying it in the test name, or with a flag?

ERS.f19_g16.I1850CLM45.edison_intel17.clm-betr

or

--compiler=intel17

sounds like the first one right?

@ndkeen
Copy link
Contributor Author

ndkeen commented May 24, 2017

Yes, the first one. Because this test has a modifier after the compiler name, I guess the --compiler= trick won't work.

jgfouca pushed a commit that referenced this issue Jun 2, 2017
Update testreporter and change hobart queue to medium.
Update testreporter.py to handle compare failures that were being missed.
Remove tagname from the testdb comments that were added to the GENERATE and
BASLINE lines in TestStatus.
Change the default queue on hobart from short to medium to handle tests that were
running a little long.

Test suite: scripts_regression_tests.pr, populated testdb for alpha06m
Test baseline:
Test namelist changes:
Test status: bit for bit,

Fixes #1555

User interface changes?:

Code review:jedwards
@rljacob rljacob assigned bishtgautam and rgknox and unassigned bishtgautam Jun 7, 2017
@bishtgautam
Copy link
Contributor

This test failure is a known issue with intel15+edison, but the test works fine with intel17.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 27, 2017

After edison upgrade July 2017, I don't see the Intel v15 installed. So maybe can close as "won't fix"?

@ndkeen ndkeen closed this as completed Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants