SEACAS tests failing in ATDM CUDA builds starting 4/26/2018 #2650

bartlettroscoe · 2018-04-27T12:13:04Z

CC: @trilinos/seacas, @gsjaardema, @fryeguy52

Next Action Status

Updated SEACAS is also causing mesh reading problems on non-CUDA builds for larger numbers of MPI ranks. PR #2653 was merged on 4/27/2018. which reverts PR #2625 updating SEACAS. New issue will be opened if next SEACAS snapshot cause an error.

Description

As shown in the query:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-04-26&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=passed&field3=status&compare3=62&value3=notrun&field4=testname&compare4=65&value4=SEACAS

The SEACAS tests:

SEACASIoss_exodus32_to_exodus32
SEACASIoss_exodus32_to_exodus32_pnetcd
SEACASIoss_exodus32_to_exodus64

are failling in all of the current ATDM Trilinos CUDA builds:

Trilinos-atdm-hansen-shiller-cuda-debug
Trilinos-atdm-hansen-shiller-cuda-opt
Trilinos-atdm-white-ride-cuda-debug
Trilinos-atdm-white-ride-cuda-opt

This was likely due to the update of SEACAS into Trilinos in the commit 89d48ad merged in the PR #2625 .

Steps to Reproduce

One should be able to reproduce these failing tests on the machines white (SON), ride (SRN), hansen (SON), or shiller (SRN) as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

For example, on white one should be able to reproduce these failing tests with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-04-27T12:40:29Z

@gsjaardema,

Just curious, but did you get a CDash email about these failures like the one shown below? It looks like Trilinos is currently set up to send email to the address seacas-regression at software.sandia.gov. It looks like that mailman list exists. It is set up to send you emails?

From: CDash [mailto:trilinos-regression@sandia.gov]
Sent: Thursday, April 26, 2018 11:23 AM
To: Bartlett, Roscoe A
Subject: FAILED (t=3): Trilinos/SEACAS - Trilinos-atdm-hansen-shiller-cuda-debug - ATDM

A submission to CDash for the project Trilinos has failing tests.
You have been identified as one of the authors who have checked in changes
that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at
https://testing.sandia.gov/cdash/buildSummary.php?buildid=3530258

Project: Trilinos
SubProject: SEACAS
Site: hansen
Build Name: Trilinos-atdm-hansen-shiller-cuda-debug
Build Time: 2018-04-26T15:19:09 UTC
Type: ATDM
Tests failing: 3

Tests failing
SEACASIoss_exodus32_to_exodus32_pnetcdf
(https://testing.sandia.gov/cdash/testDetails.php?test=47122041&build=3530258)
SEACASIoss_exodus32_to_exodus32
(https://testing.sandia.gov/cdash/testDetails.php?test=47122042&build=3530258)
SEACASIoss_exodus32_to_exodus64
(https://testing.sandia.gov/cdash/testDetails.php?test=47122043&build=3530258)

-CDash on testing.sandia.gov

bartlettroscoe · 2018-04-27T13:15:08Z

These tests all seem to be terminating early with the error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorCudartUnloading): driver shutting down /home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119

rppawlo · 2018-04-27T13:56:09Z

It's more than just cuda. @pwxy @bathmatt are seeing failures on other machines as well when you go to larger MPI process counts.

pwxy · 2018-04-27T14:01:11Z

Broke EMPIRE on mutrino HSW and KNL. For example for mutrino HSW when reading an exodus meshed decomposed into 8 submains, EMPIRE is fine. But when try to read the same exodus mesh decomposed into 16 domains, get the following errors:

Exodus Library Warning/Error: [ex_check_valid_file_id]
ERROR: In "ex_inquire_internal", the file id -1 was not obtained via a call to "ex_open" or "ex_create".
It does not refer to a valid open exodus file.
Aborting to avoid file corruption or data loss or other potential problems.

bartlettroscoe · 2018-04-27T14:11:17Z

Is there some way to define a native SEACAS test that can show these failures and then fix the failing test? Can the failure being described be demonstrated with a smaller number of MPI ranks?

bathmatt · 2018-04-27T14:12:13Z

Can we revert it until this is all worked out? I've confirmed on my standard RHEL:7 desktop this breaks reading exodus files with more than 9 mpi ranks. No idea on why

bathmatt · 2018-04-27T14:26:52Z

@bartlettroscoe I'm betting it happens in panzer tests as well, I'll run mini-EM and verify

bathmatt · 2018-04-27T14:28:11Z

The legendary @pwxy figured out that if you remove the 0s in the decomposed mesh mesh.16.9 and not mesh.16.09 it now works.. What this on purpose? Does decomp make the right meshes?

pwxy · 2018-04-27T14:30:11Z

Actually it was the AMAZING GENIUS @bathmatt who figured this out!

bathmatt · 2018-04-27T14:38:05Z

mini-EM works with the mesh decomped. No idea now what is goig on. We use the same mesh reader.

gsjaardema · 2018-04-27T14:45:42Z

We should probably revert until can figure out what is happening. All this works fine in seacas standalone and in sierra so no clue yet why trilinos having issues. I will be out until next Wednesday so best to revert and try again later. Should be no reason why. 0 in proc causes failure. It isn’t an octal issue I don’t think. Also works in other builds. I will revert. .. greg

…

On Fri, Apr 27, 2018 at 8:38 AM bathmatt ***@***.***> wrote: mini-EM works with the mesh decomped. No idea now what is goig on. We use the same mesh reader. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2650 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2xDpMsKEDGrU1cc9ykf1Yw_rNbz0eoks5tsy1NgaJpZM4TqRCL> .

gsjaardema · 2018-04-27T14:47:07Z

What library versions of netcdf and hdf5 are being used for these builds? .. greg On Fri, Apr 27, 2018 at 8:45 AM Gregory Sjaardema <gsjaardema@gmail.com> wrote:

…

We should probably revert until can figure out what is happening. All this works fine in seacas standalone and in sierra so no clue yet why trilinos having issues. I will be out until next Wednesday so best to revert and try again later. Should be no reason why. 0 in proc causes failure. It isn’t an octal issue I don’t think. Also works in other builds. I will revert. .. greg On Fri, Apr 27, 2018 at 8:38 AM bathmatt ***@***.***> wrote: > mini-EM works with the mesh decomped. No idea now what is goig on. We use > the same mesh reader. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2650 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA2xDpMsKEDGrU1cc9ykf1Yw_rNbz0eoks5tsy1NgaJpZM4TqRCL> > . >

bathmatt · 2018-04-27T14:49:02Z

sems versions, the empire 0# thing is internal to some error checking in EMPIRE, not sure why it this error changes stuff, did something change in
int file_id = ex_open_int(file.c_str(), mode, &comp_ws, &io_ws, &version, EX_API_VERS_NODOT);
I can work around that issue, but the cuda one I'm not sure... Don't revert for the EMPIRE issue, I will fix it on our end.

gsjaardema · 2018-04-27T14:49:17Z

@bartlettroscoe CUDA tests are failing during application shutdown. My guess is that the SEACAS standalone has older version of Kokkos and the newer Trilinos/Kokkos has different shutdown behavior or requirements... Will try to verify.

pwxy · 2018-04-27T14:51:37Z

For mutrino:
hdf5 1.10.1
netcdf 4.4.1.1

gsjaardema · 2018-04-27T14:52:10Z

@bathmatt There were changes to ex_open_int, but primarily they should have been limited to error checking. There are some issues in some NetCDF versions based on some defines that should be there but are missing which might mess things up. I will try with SEMS and see.

I will hold off on reverting unless others request...

bartlettroscoe · 2018-04-27T14:54:11Z

What library versions of netcdf and hdf5 are being used for these builds?

@gsjaardema, if you look at the configure output on CDash at, for example:

https://testing-vm.sandia.gov/cdash/viewConfigure.php?buildid=3464194

it shows:

Processing enabled TPL: HDF5 (enabled explicitly, disable with -DTPL_ENABLE_HDF5=OFF)
-- HDF5_LIBRARY_NAMES='hdf5;z;hdf5_hl'
-- TPL_HDF5_LIBRARIES='-L/home/projects/x86-64-haswell-nvidia/hdf5/1.10.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib;-lhdf5_hl;-lhdf5;-lz;-ldl'
-- TPL_HDF5_INCLUDE_DIRS='/home/projects/x86-64-haswell-nvidia/hdf5/1.10.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/include'
Processing enabled TPL: Netcdf (enabled explicitly, disable with -DTPL_ENABLE_Netcdf=OFF)
-- Netcdf_LIBRARY_NAMES='netcdf'
-- TPL_Netcdf_LIBRARIES='-L/home/projects/x86-64/boost/1.55.0/lib;-L/home/projects/x86-64-haswell-nvidia/netcdf-exo/4.4.1.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib;-L/home/projects/x86-64-haswell-nvidia/netcdf-exo/4.4.1.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib;-L/home/projects/x86-64-haswell-nvidia/pnetcdf-exo/1.8.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib;/home/projects/x86-64/boost/1.55.0/lib/libboost_program_options.a;/home/projects/x86-64/boost/1.55.0/lib/libboost_system.a;/home/projects/x86-64-haswell-nvidia/netcdf-exo/4.4.1.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib/libnetcdf.a;/home/projects/x86-64-haswell-nvidia/pnetcdf-exo/1.8.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib/libpnetcdf.a;-L/home/projects/x86-64-haswell-nvidia/hdf5/1.10.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/lib;-lhdf5_hl;-lhdf5;-lz;-ldl'
-- TPL_Netcdf_INCLUDE_DIRS='/home/projects/x86-64-haswell-nvidia/netcdf-exo/4.4.1.1/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61/include'
Processing enabled TPL: BoostLib (enabled explicitly, disable with -DTPL_ENABLE_BoostLib=OFF)

I think thee were installed by the test bed team. If those need upgrade, then we need to contact them.

gsjaardema · 2018-04-27T14:56:49Z

There are some potential issues with hdf5-1.10.1 especially when used with an older netcdf in that it can potentially create files that are not readable with older versions of hdf5-1.8.X. The HDF5 group fixed this in hdf5-1.10.2 with special build options --with-default-api-version=v18 and we added a patch to NetCDF-4.6.2-devel to select v1.8 compatibility. Issues don't always appear and if using consisten library versions it should be OK.

I probably need to get more involved in the SEMS discussions to avoid some of this...

gsjaardema · 2018-04-27T15:02:11Z

@bartlettroscoe The configure output you are showing seems to also indicate that it isn't using the FindNetcdf.cmake that is in TriBITS? It should be setting some other TPL_Netcdf_* symbols that don't seem to be there. The output I usually see is something like:

-- Found NetCDF: /Users/gdsjaar/src/seacas-parallel/lib/libnetcdf.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5_hl.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5.dylib;/usr/lib/libz.dylib;/usr/lib/libdl.dylib;/usr/lib/libm.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libpnetcdf.a
-- NetCDF Version: netCDF 4.6.2-development
--      NetCDF_NEEDS_HDF5        = True
--      NetCDF_NEEDS_PNetCDF     = True
--      NetCDF_PARALLEL          = True
--      NetCDF_INCLUDE_DIRS      = /Users/gdsjaar/src/seacas-parallel/include;/Users/gdsjaar/src/seacas-parallel/include;/Users/gdsjaar/src/seacas-parallel/include
--      NetCDF_LIBRARIES         = /Users/gdsjaar/src/seacas-parallel/lib/libnetcdf.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5_hl.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5.dylib;/usr/lib/libz.dylib;/usr/lib/libdl.dylib;/usr/lib/libm.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libpnetcdf.a
--      NetCDF_BINARIES          = ncdump;ncgen;nccopy
-- Netcdf_LIBRARY_NAMES='netcdf'
-- TPL_Netcdf_LIBRARIES='/Users/gdsjaar/src/seacas-parallel/lib/libnetcdf.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5_hl.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libhdf5.dylib;/usr/lib/libz.dylib;/usr/lib/libdl.dylib;/usr/lib/libm.dylib;/Users/gdsjaar/src/seacas-parallel/lib/libpnetcdf.a'
-- TPL_Netcdf_INCLUDE_DIRS='/Users/gdsjaar/src/seacas-parallel/include;/Users/gdsjaar/src/seacas-parallel/include;/Users/gdsjaar/src/seacas-parallel/include'
Processing enabled TPL: CGNS (enabled explicitly, disable with -DTPL_ENABLE_CGNS=OFF)

The main symbols that I need are NetCDF_NEEDS_HDF5, NetCDF_PARALLEL, and NetCDF_NEEDS_PNetCDF = True | False

bartlettroscoe · 2018-04-27T15:04:22Z

@bartlettroscoe I'm betting it happens in panzer tests as well, I'll run mini-EM and verify

@bathmatt, no all of the panzer tests and examples fully passed on all of the builds we currently have running as shown in the CDash query:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-04-26&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=-atdm-&field2=subprojects&compare2=93&value2=Panzer

(Ignore the one failure on 'ride' for the build Trilinos-atdm-white-ride-gnu-opt-openmp. We have seen tests randomly fail on 'ride' that pass just fine on the identical machine 'white'. That is why this build on 'ride' was demoted to the "Specialized" CDash Track/Group. See #2511.)

Did this updated Trilinos fail any EMPIRE automated tests? If not, then someone needs to add an automated test to either SEACAS (best), Panzer (okay) or EMPIRE (if nothing else) to cover this use case.

@micahahoward, you might want to be aware of this in case this impacts SPARC on your next update of Trilinos.

All,

Unless some changes in SEACAS are urgent for some Trilinos customer, we can just back out the merge commit from PR #2625 so that people can fix this offline in a non stressful way.

bartlettroscoe · 2018-04-27T15:08:21Z

The configure output you are showing seems to also indicate that it isn't using the FindNetcdf.cmake that is in TriBITS?

@gsjaardema, this is using the EMPIRE configure of Trilinos copied from the scripts in the EM-Plasma/BuildScripts/ repo. If we can update the ATDM configuration to better use the FindNetcdf.cmake module (hopefully in an updated TplFindNetcdf.cmake module), then we can use it. But that would require careful testing on every platform before we could push that to the 'develop' branch. Or we would have to make the change, demote all of the ATDM builds going to the "ATDM" CDash Track/Group back down to the "Specialized" CDash Track/Group, and then cross our fingers. This is what I did with the last major upgrade of the ATDM Trilinos configuration changes (when we last sycned with the configuration in the scripts in the EM-Plasma/BuildScripts/ repo which was a while ago now).

gsjaardema · 2018-04-27T15:08:22Z

@bartlettroscoe I will add a SEACAS test covering the use case, but I'm not sure what use case is failing currently (other than the CUDA-related ones).

Since I won't be able to do much until Wednesday, may be best to back out the merge commit from #2625. It was not urgent for any customers.

bartlettroscoe · 2018-04-27T15:10:31Z

@gsjaardema,

Since I won't be able to do much until Wednesday, may be best to back out the merge commit from #2625. It was not urgent for any customers.

Okay, so unless there is an objection, I am going to back this merge commit out.

bathmatt · 2018-04-27T15:11:13Z

The EMPIRE issue has been resolved with changes in it. You probably strengthen checks in SEACAS that were now triggering. But that issues on EMPIRE are resolved. Now the cuda stuff is a different matter.

gsjaardema · 2018-04-27T15:12:07Z

@bartlettroscoe Not sure I understand issue with FindNetcdf.cmake in TriBITs? I thought that all Trilinos builds used the TriBITs code and that we had fully vetted the FindNetcdf issues several months ago?

pwxy · 2018-04-27T15:15:37Z

@bathmatt I backed out your change from "0" to "EX_API_VERS_NODOT" in the ex_open_int call yesterday when I was debugging, and it didn't help the problem with empire failing to read the exodus files
int file_id = ex_open_int(file.c_str(), mode, &comp_ws, &io_ws, &version, EX_API_VERS_NODOT);

gsjaardema · 2018-04-27T15:20:30Z

@pwxy, @bathmatt: You should not be calling ex_open_int in your application. You should call ex_open as you always have in the past. The ex_open_int is an internal only function that is called by the ex_open wrapper which automatically adds the EX_API_VERS_NODOT to verify that the include file matches the include file used when the library was compiled.

This has been in place for several years, so wasn't particular to this commit.

Please go back to ex_open

gsjaardema · 2018-04-27T15:23:10Z

If you ever think that there was a non-backward-compatible change to exodus, please let me know before changing any code. There are a few deprecated functions, but they are still usable. You should never need to change your application code for a new Exodus version unless I explicitly mention it in an email or release notes or other notification.

bathmatt · 2018-04-27T15:26:12Z

@gsjaardema, There was a non backward compatible change only in the extent that I was doing something wrong and getting away with it and you added error checking that caught it, shame on you, shame on you :)

I'd have seen it if I ran periodic meshes on more than 9 ranks with bad values.

gsjaardema · 2018-04-27T15:27:59Z

@bathmatt OK, I was just a little worried about references to the ex_open_int function. I need to see if I can hide it better. I had a user yesterday also trying to use it...

bartlettroscoe · 2018-04-27T15:31:28Z

Not sure I understand issue with FindNetcdf.cmake in TriBITs? I thought that all Trilinos builds used the TriBITs code and that we had fully vetted the FindNetcdf issues several months ago?

@gsjaardema, as shown at:

Trilinos/cmake/std/atdm/ride/environment.sh

Line 63 in cd6dc17

    
           export ATDM_CONFIG_NETCDF_LIBS="-L${BOOST_ROOT}/lib;-L${NETCDF_ROOT}/lib;-L${NETCDF_ROOT}/lib;-L${PNETCDF_ROOT}/lib;-L${HDF5_ROOT}/lib;${BOOST_ROOT}/lib/libboost_program_options.a;${BOOST_ROOT}/lib/libboost_system.a;${NETCDF_ROOT}/lib/libnetcdf.a;${PNETCDF_ROOT}/lib/libpnetcdf.a;${HDF5_ROOT}/lib/libhdf5_hl.a;${HDF5_ROOT}/lib/libhdf5.a;-lz;-ldl"

and

Trilinos/cmake/std/atdm/ATDMDevEnvSettings.cmake

Line 205 in cd6dc17

# Netcdf

the EMPIRE configuration of Trilinos bypasses the FindNetcdf.cmake find module and just directly sets the include dirs and libraries. This mode is allowed to support direct setting and backward compatibility as per:

https://tribits.org/doc/TribitsDevelopersGuide.html#how-to-use-find-package-for-a-tribits-tpl

It is possible to update this Trilinos configuration of Trilinos to allow the use find_package(NetCDF) but that will take a lot of testing, including against builds of EMPIRE and manual testing by EMPIRE developers and users to do that safely (the native Panzer and EMPIRE test suites don't test all functionality from SEACAS that is used by EMPIRE developers and users, see TRIL-171).

As we discussed before, we don't have any specific documentation in the Trilinos build reference for how to configure with this specialized Netcdf setup. Therefore, we can't expect people to know how to use this.

pwxy · 2018-04-27T15:33:13Z

@gsjaardema I think that user was me. It was the exact ex_open_int() in empire I mentioned above. I was trying to track down the issue with failing to read exodus files with more than 9 MPI, so I was looking at the exodus calls in empire.

gsjaardema · 2018-04-27T15:35:19Z

@bartlettroscoe RE: FindNetcdf.cmake. OK, I understand. We will probably need to modify the environment.sh and ATDMDevEnvSettings.cmake to add manual definitions of some of the symbols set in FindNetcdf.cmake in order to make sure the builds are consistent.

Alternatively, I will see if I can determine the settings down in the SEACAS CMake-related code at configure time which may be more robust than relying on manual settings...

gsjaardema · 2018-04-27T15:36:53Z

@pwxy I was ptlin, but may have been related to the same issue since the symptoms seemed similar...

bartlettroscoe · 2018-04-27T15:37:55Z

Alternatively, I will see if I can determine the settings down in the SEACAS CMake-related code at configure time which may be more robust than relying on manual settings...

@gsjaardema, let me know if you have any difficulty reproducing any of these ATDM builds. I tried to make it as easy as I could think to make it. Just source a single script with the build name that you want and run raw cmake passingin a single *.cmake file and enable any package you want.

pwxy · 2018-04-27T15:38:25Z

@gsjaardema Well, that ptlin guy is a moron!

…b_snapshot" This reverts commit 1b19c57, reversing changes made to aa0c96b. There are some issues with this update that is documented in trilinos#2650. This reverts the updates in PR trilinos#2650. Reverting these changse will allow the issues to be fixed offline in a non urgent way.

…b_snapshot" This reverts commit 1b19c57, reversing changes made to aa0c96b. There are some issues with this update that are documented in trilinos#2650. This reverts the updates pulled in from PR trilinos#2625. Reverting these changes will allow the issues to be fixed offline in a non urgent way.

bartlettroscoe · 2018-04-27T16:20:26Z

@gsjaardema, I created the PR #2653 to revert this merge commit from #2625. Can you please approve it?

…hot-pr-2625 Revert "Merge pull request #2625 from gsjaardema/seacas_github_snapshot" This is temp fix for some issues with this update that are documented in #2650. The issues can now be addressed offline.

bartlettroscoe · 2018-04-27T21:17:37Z

@gsjaardema approved the PR #2653 that reverted this SEACAS update and it passed testing and I merged it just now. Therefore, we should see these CUDA failures go away and if EMPIRE updates its Trilinos develop branch, the issues should be gone now.

I will leave this issue open to track efforts to address these issues offline.

Note that to help address this, you can build Trilinos with the version of SEACAS from the independent SEACAS git repo. You just clone (or symlink) @gsjaardema,'s SEACAS git repo under the main Trilinos git repo like:

$ cd Trilinos/
$ git clone git@github.com:gsjaardema/seacas.git

Then when you configure Trilinos, add the cache var:

-D SEACAS_SOURCE_DIR_OVERRIDE:STRING=seacas

NOTE: This is exactly how SPARC builds Trilinos with SEACAS currently.

That might make it easier to iteratively debug and fix the issues.

bathmatt · 2018-04-29T14:44:16Z

Thanks, the EMPIRE issue is resolved minus the cuda issue.

bartlettroscoe · 2018-05-04T03:38:20Z

The merge was backed out on 4/27/2018. Should we keep this issue open still or should we close it.

@gsjaardema, steps to reproduce with the CUDA build are given at the top of this issue. Note that you can work with your native SEACAS repo by cloning or symlinking the seacas repo under Trilinos/ and then configure, build, and test with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DSEACAS_SOURCE_DIR_OVERRIDE:STRING=seacas \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

If that does not work, please let me know.

bartlettroscoe · 2018-05-22T01:56:49Z

This was resolved by backing out the merge commit of the latest SEACAS snapshot over 2 weeks ago. If a failure occurs on the next SEACAS snapshot, then we will open an new Issue for that.

gsjaardema · 2018-08-01T14:10:04Z

This should be able to be closed. The seacas source code in Trilinos is up-to-date with both SEACAS github and SEACAS/Sierra and all tests are passing and there have been no reports of issues from other projects.

bartlettroscoe · 2018-08-01T14:23:48Z

@gsjaardema, I closed this issue back on 5/21/2018 as noted above. I figured that if there were any new issues on a new snapshot of SEACAS, then we would open new issues.

Just to verify, looking at the SEACASIoss_exodus32_XXX tests running in the CUDA builds on hansen builds yesterday, we can see that the tests:

SEACASIoss_exodus32_to_exodus32
SEACASIoss_exodus32_to_exodus32_pnetcd
SEACASIoss_exodus32_to_exodus64

reported failing above are passing in all of the CUDA builds.

NOTE: We currently don't have working CUDA builds on 'white' and 'ride' due to a system upgrade (see
TRIL-215).

bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: seacas client: ATDM Any issue primarily impacting the ATDM project labels Apr 27, 2018

bartlettroscoe added this to the Keep promoted "ATDM" builds of Trilinos clean milestone Apr 27, 2018

bartlettroscoe mentioned this issue Apr 27, 2018

Set up a CUDA build for an auto PR build #2464

Closed

bartlettroscoe mentioned this issue Apr 27, 2018

Revert "Merge pull request #2625 from gsjaardema/seacas_github_snapshot" #2653

Merged

bartlettroscoe mentioned this issue May 4, 2018

Framework: Allow commits to non-code directories w/o autotesting #2594

Closed

bartlettroscoe closed this as completed May 22, 2018

prwolfe mentioned this issue Jun 13, 2018

Import stk development from Sierra #2930

Merged

2 tasks

bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018

trilinos-autotester mentioned this issue Jun 9, 2021

Tpetra: Skip unpackAndCombine #9133

Merged

trilinos-autotester mentioned this issue Nov 4, 2021

Trilinos Master Merge PR Generator: Auto PR created to promote from master_merge_20211104_000553 branch to master #9899

Closed

SEACAS tests failing in ATDM CUDA builds starting 4/26/2018 #2650

SEACAS tests failing in ATDM CUDA builds starting 4/26/2018 #2650

Comments

bartlettroscoe commented Apr 27, 2018 • edited Loading

Next Action Status

Description

Steps to Reproduce

bartlettroscoe commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

rppawlo commented Apr 27, 2018

pwxy commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

bathmatt commented Apr 27, 2018

bathmatt commented Apr 27, 2018

bathmatt commented Apr 27, 2018

pwxy commented Apr 27, 2018

bathmatt commented Apr 27, 2018

gsjaardema commented Apr 27, 2018 via email

gsjaardema commented Apr 27, 2018 via email

bathmatt commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

pwxy commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

bathmatt commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

pwxy commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bathmatt commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

pwxy commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

gsjaardema commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

pwxy commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

bartlettroscoe commented Apr 27, 2018

bathmatt commented Apr 29, 2018

bartlettroscoe commented May 4, 2018

bartlettroscoe commented May 22, 2018

gsjaardema commented Aug 1, 2018

bartlettroscoe commented Aug 1, 2018

bartlettroscoe commented Apr 27, 2018 •

edited

Loading