-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEACAS tests failing in ATDM CUDA builds starting 4/26/2018 #2650
Comments
Just curious, but did you get a CDash email about these failures like the one shown below? It looks like Trilinos is currently set up to send email to the address From: CDash [mailto:trilinos-regression@sandia.gov] A submission to CDash for the project Trilinos has failing tests. Details on the submission can be found at Project: Trilinos Tests failing -CDash on testing.sandia.gov |
These tests all seem to be terminating early with the error:
|
Broke EMPIRE on mutrino HSW and KNL. For example for mutrino HSW when reading an exodus meshed decomposed into 8 submains, EMPIRE is fine. But when try to read the same exodus mesh decomposed into 16 domains, get the following errors: Exodus Library Warning/Error: [ex_check_valid_file_id] |
Is there some way to define a native SEACAS test that can show these failures and then fix the failing test? Can the failure being described be demonstrated with a smaller number of MPI ranks? |
Can we revert it until this is all worked out? I've confirmed on my standard RHEL:7 desktop this breaks reading exodus files with more than 9 mpi ranks. No idea on why |
@bartlettroscoe I'm betting it happens in panzer tests as well, I'll run mini-EM and verify |
The legendary @pwxy figured out that if you remove the 0s in the decomposed mesh mesh.16.9 and not mesh.16.09 it now works.. What this on purpose? Does decomp make the right meshes? |
Actually it was the AMAZING GENIUS @bathmatt who figured this out! |
mini-EM works with the mesh decomped. No idea now what is goig on. We use the same mesh reader. |
We should probably revert until can figure out what is happening. All this
works fine in seacas standalone and in sierra so no clue yet why trilinos
having issues. I will be out until next Wednesday so best to revert and
try again later.
Should be no reason why. 0 in proc causes failure. It isn’t an octal issue
I don’t think. Also works in other builds.
I will revert.
.. greg
…On Fri, Apr 27, 2018 at 8:38 AM bathmatt ***@***.***> wrote:
mini-EM works with the mesh decomped. No idea now what is goig on. We use
the same mesh reader.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2650 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2xDpMsKEDGrU1cc9ykf1Yw_rNbz0eoks5tsy1NgaJpZM4TqRCL>
.
|
What library versions of netcdf and hdf5 are being used for these builds?
.. greg
On Fri, Apr 27, 2018 at 8:45 AM Gregory Sjaardema <gsjaardema@gmail.com>
wrote:
… We should probably revert until can figure out what is happening. All
this works fine in seacas standalone and in sierra so no clue yet why
trilinos having issues. I will be out until next Wednesday so best to
revert and try again later.
Should be no reason why. 0 in proc causes failure. It isn’t an octal
issue I don’t think. Also works in other builds.
I will revert.
.. greg
On Fri, Apr 27, 2018 at 8:38 AM bathmatt ***@***.***> wrote:
> mini-EM works with the mesh decomped. No idea now what is goig on. We use
> the same mesh reader.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2650 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA2xDpMsKEDGrU1cc9ykf1Yw_rNbz0eoks5tsy1NgaJpZM4TqRCL>
> .
>
|
sems versions, the empire 0# thing is internal to some error checking in EMPIRE, not sure why it this error changes stuff, did something change in |
@bartlettroscoe CUDA tests are failing during application shutdown. My guess is that the SEACAS standalone has older version of Kokkos and the newer Trilinos/Kokkos has different shutdown behavior or requirements... Will try to verify. |
For mutrino: |
@bathmatt There were changes to ex_open_int, but primarily they should have been limited to error checking. There are some issues in some NetCDF versions based on some defines that should be there but are missing which might mess things up. I will try with SEMS and see. I will hold off on reverting unless others request... |
@gsjaardema, if you look at the configure output on CDash at, for example: it shows:
I think thee were installed by the test bed team. If those need upgrade, then we need to contact them. |
There are some potential issues with hdf5-1.10.1 especially when used with an older netcdf in that it can potentially create files that are not readable with older versions of hdf5-1.8.X. The HDF5 group fixed this in hdf5-1.10.2 with special build options I probably need to get more involved in the SEMS discussions to avoid some of this... |
@bartlettroscoe The configure output you are showing seems to also indicate that it isn't using the FindNetcdf.cmake that is in TriBITS? It should be setting some other TPL_Netcdf_* symbols that don't seem to be there. The output I usually see is something like:
The main symbols that I need are |
@bathmatt, no all of the panzer tests and examples fully passed on all of the builds we currently have running as shown in the CDash query: (Ignore the one failure on 'ride' for the build Did this updated Trilinos fail any EMPIRE automated tests? If not, then someone needs to add an automated test to either SEACAS (best), Panzer (okay) or EMPIRE (if nothing else) to cover this use case. @micahahoward, you might want to be aware of this in case this impacts SPARC on your next update of Trilinos. All, Unless some changes in SEACAS are urgent for some Trilinos customer, we can just back out the merge commit from PR #2625 so that people can fix this offline in a non stressful way. |
@gsjaardema, this is using the EMPIRE configure of Trilinos copied from the scripts in the EM-Plasma/BuildScripts/ repo. If we can update the ATDM configuration to better use the FindNetcdf.cmake module (hopefully in an updated TplFindNetcdf.cmake module), then we can use it. But that would require careful testing on every platform before we could push that to the 'develop' branch. Or we would have to make the change, demote all of the ATDM builds going to the "ATDM" CDash Track/Group back down to the "Specialized" CDash Track/Group, and then cross our fingers. This is what I did with the last major upgrade of the ATDM Trilinos configuration changes (when we last sycned with the configuration in the scripts in the EM-Plasma/BuildScripts/ repo which was a while ago now). |
@bartlettroscoe I will add a SEACAS test covering the use case, but I'm not sure what use case is failing currently (other than the CUDA-related ones). Since I won't be able to do much until Wednesday, may be best to back out the merge commit from #2625. It was not urgent for any customers. |
Okay, so unless there is an objection, I am going to back this merge commit out. |
The EMPIRE issue has been resolved with changes in it. You probably strengthen checks in SEACAS that were now triggering. But that issues on EMPIRE are resolved. Now the cuda stuff is a different matter. |
@bartlettroscoe Not sure I understand issue with FindNetcdf.cmake in TriBITs? I thought that all Trilinos builds used the TriBITs code and that we had fully vetted the FindNetcdf issues several months ago? |
@bathmatt I backed out your change from "0" to "EX_API_VERS_NODOT" in the ex_open_int call yesterday when I was debugging, and it didn't help the problem with empire failing to read the exodus files |
@pwxy, @bathmatt: You should not be calling This has been in place for several years, so wasn't particular to this commit. Please go back to |
If you ever think that there was a non-backward-compatible change to exodus, please let me know before changing any code. There are a few deprecated functions, but they are still usable. You should never need to change your application code for a new Exodus version unless I explicitly mention it in an email or release notes or other notification. |
@gsjaardema, There was a non backward compatible change only in the extent that I was doing something wrong and getting away with it and you added error checking that caught it, shame on you, shame on you :) I'd have seen it if I ran periodic meshes on more than 9 ranks with bad values. |
@bathmatt OK, I was just a little worried about references to the |
@gsjaardema, as shown at:
and
the EMPIRE configuration of Trilinos bypasses the FindNetcdf.cmake find module and just directly sets the include dirs and libraries. This mode is allowed to support direct setting and backward compatibility as per: It is possible to update this Trilinos configuration of Trilinos to allow the use As we discussed before, we don't have any specific documentation in the Trilinos build reference for how to configure with this specialized Netcdf setup. Therefore, we can't expect people to know how to use this. |
@gsjaardema I think that user was me. It was the exact ex_open_int() in empire I mentioned above. I was trying to track down the issue with failing to read exodus files with more than 9 MPI, so I was looking at the exodus calls in empire. |
@bartlettroscoe RE: FindNetcdf.cmake. OK, I understand. We will probably need to modify the environment.sh and ATDMDevEnvSettings.cmake to add manual definitions of some of the symbols set in FindNetcdf.cmake in order to make sure the builds are consistent. Alternatively, I will see if I can determine the settings down in the SEACAS CMake-related code at configure time which may be more robust than relying on manual settings... |
@pwxy I was ptlin, but may have been related to the same issue since the symptoms seemed similar... |
@gsjaardema, let me know if you have any difficulty reproducing any of these ATDM builds. I tried to make it as easy as I could think to make it. Just source a single script with the build name that you want and run raw cmake passingin a single |
@gsjaardema Well, that ptlin guy is a moron! |
…b_snapshot" This reverts commit 1b19c57, reversing changes made to aa0c96b. There are some issues with this update that is documented in trilinos#2650. This reverts the updates in PR trilinos#2650. Reverting these changse will allow the issues to be fixed offline in a non urgent way.
…b_snapshot" This reverts commit 1b19c57, reversing changes made to aa0c96b. There are some issues with this update that are documented in trilinos#2650. This reverts the updates pulled in from PR trilinos#2625. Reverting these changes will allow the issues to be fixed offline in a non urgent way.
@gsjaardema, I created the PR #2653 to revert this merge commit from #2625. Can you please approve it? |
@gsjaardema approved the PR #2653 that reverted this SEACAS update and it passed testing and I merged it just now. Therefore, we should see these CUDA failures go away and if EMPIRE updates its Trilinos I will leave this issue open to track efforts to address these issues offline. Note that to help address this, you can build Trilinos with the version of SEACAS from the independent SEACAS git repo. You just clone (or symlink) @gsjaardema,'s SEACAS git repo under the main Trilinos git repo like:
Then when you configure Trilinos, add the cache var:
NOTE: This is exactly how SPARC builds Trilinos with SEACAS currently. That might make it easier to iteratively debug and fix the issues. |
Thanks, the EMPIRE issue is resolved minus the cuda issue. |
The merge was backed out on 4/27/2018. Should we keep this issue open still or should we close it. @gsjaardema, steps to reproduce with the CUDA build are given at the top of this issue. Note that you can work with your native SEACAS repo by cloning or symlinking the
If that does not work, please let me know. |
This was resolved by backing out the merge commit of the latest SEACAS snapshot over 2 weeks ago. If a failure occurs on the next SEACAS snapshot, then we will open an new Issue for that. |
This should be able to be closed. The seacas source code in Trilinos is up-to-date with both SEACAS github and SEACAS/Sierra and all tests are passing and there have been no reports of issues from other projects. |
@gsjaardema, I closed this issue back on 5/21/2018 as noted above. I figured that if there were any new issues on a new snapshot of SEACAS, then we would open new issues. Just to verify, looking at the SEACASIoss_exodus32_XXX tests running in the CUDA builds on hansen builds yesterday, we can see that the tests:
reported failing above are passing in all of the CUDA builds. NOTE: We currently don't have working CUDA builds on 'white' and 'ride' due to a system upgrade (see |
CC: @trilinos/seacas, @gsjaardema, @fryeguy52
Next Action Status
Updated SEACAS is also causing mesh reading problems on non-CUDA builds for larger numbers of MPI ranks. PR #2653 was merged on 4/27/2018. which reverts PR #2625 updating SEACAS. New issue will be opened if next SEACAS snapshot cause an error.
Description
As shown in the query:
The SEACAS tests:
SEACASIoss_exodus32_to_exodus32
SEACASIoss_exodus32_to_exodus32_pnetcd
SEACASIoss_exodus32_to_exodus64
are failling in all of the current ATDM Trilinos CUDA builds:
Trilinos-atdm-hansen-shiller-cuda-debug
Trilinos-atdm-hansen-shiller-cuda-opt
Trilinos-atdm-white-ride-cuda-debug
Trilinos-atdm-white-ride-cuda-opt
This was likely due to the update of SEACAS into Trilinos in the commit 89d48ad merged in the PR #2625 .
Steps to Reproduce
One should be able to reproduce these failing tests on the machines
white
(SON),ride
(SRN),hansen
(SON), orshiller
(SRN) as described in:For example, on
white
one should be able to reproduce these failing tests with:The text was updated successfully, but these errors were encountered: