Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS Seg faulting after installing it from the MACE repo #819

Open
ShubhangG opened this issue Feb 7, 2025 · 0 comments
Open

LAMMPS Seg faulting after installing it from the MACE repo #819

ShubhangG opened this issue Feb 7, 2025 · 0 comments

Comments

@ShubhangG
Copy link

Describe the bug
Hello after step by step following the installation of MACE with LAMMPS as shown here https://mace-docs.readthedocs.io/en/latest/guide/lammps.html
I tried running lammps on my current cluster. But it provides a seg fault

[ccc0420:2262807:0:2262807] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e0)
==== backtrace (tid:2262807) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f0e9cbc4e44]
 1  /lib64/libucs.so.0(+0x2a4cd) [0x7f0e9cbc64cd]
 2  /lib64/libucs.so.0(+0x2a6aa) [0x7f0e9cbc66aa]
 3  /lib64/libc.so.6(+0x3e6f0) [0x7f0e9ce046f0]
 4  /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40(PMPI_Comm_rank+0x33) [0x7f0eb797efa3]
 5  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x59a51d]
 6  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x4953f0]
 7  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x44312f]
 8  /lib64/libc.so.6(+0x29590) [0x7f0e9cdef590]
 9  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f0e9cdef640]
10  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x444935]
=================================

The gdb output with backtrace is:

Thread 1 "lmp" received signal SIGSEGV, Segmentation fault.
0x00007ffff7c94fa3 in PMPI_Comm_rank () from /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40
(gdb) backtrace
#0  0x00007ffff7c94fa3 in PMPI_Comm_rank () from /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40
#1  0x00000000005f4301 in LAMMPS_NS::Universe::Universe (this=0x33c6930, lmp=0x346e160, communicator=1140850688)
    at /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/src/universe.cpp:33
#2  0x0000000000436e7d in LAMMPS_NS::LAMMPS::LAMMPS (this=0x346e160, narg=1, arg=0x7fffffffad18, communicator=1140850688)
    at /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/src/lammps.cpp:140
#3  0x0000000000412a16 in main (argc=1, argv=0x7fffffffad18) at /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/src/main.cpp:77

The valgrind output is:

==1542529== Memcheck, a memory error detector
==1542529== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1542529== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==1542529== Command: /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp
==1542529== 
==1542529== Warning: set address range perms: large range [0x4dbc000, 0x1f11c000) (defined)
hwloc x86 backend cannot work under Valgrind, disabling.
May be reenabled by dumping CPUIDs with hwloc-gather-cpuid
and reloading them under Valgrind with HWLOC_CPUID_PATH.
hwloc x86 backend cannot work under Valgrind, disabling.
May be reenabled by dumping CPUIDs with hwloc-gather-cpuid
and reloading them under Valgrind with HWLOC_CPUID_PATH.
==1542529== Invalid read of size 1
==1542529==    at 0x4955FA3: PMPI_Comm_rank (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x59A51C: LAMMPS_NS::Universe::Universe(LAMMPS_NS::LAMMPS*, int) (universe.cpp:33)
==1542529==    by 0x4953EF: LAMMPS_NS::LAMMPS::LAMMPS(int, char**, int) (lammps.cpp:140)
==1542529==    by 0x44312E: main (main.cpp:77)
==1542529==  Address 0x440000e0 is not stack'd, malloc'd or (recently) free'd
==1542529== 
[cc-login3:1542529:0:1542529] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e0)
==== backtrace (tid:1542529) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x1f936e44]
 1  /lib64/libucs.so.0(+0x2a4cd) [0x1f9384cd]
 2  /lib64/libucs.so.0(+0x2a6aa) [0x1f9386aa]
 3  /lib64/libc.so.6(+0x3e6f0) [0x1f5776f0]
 4  /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40(PMPI_Comm_rank+0x33) [0x4955fa3]
 5  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x59a51d]
 6  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x4953f0]
 7  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x44312f]
 8  /lib64/libc.so.6(+0x29590) [0x1f562590]
 9  /lib64/libc.so.6(__libc_start_main+0x80) [0x1f562640]
10  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x444935]
=================================
==1542529== 
==1542529== Process terminating with default action of signal 11 (SIGSEGV)
==1542529==    at 0x1F5C494C: __pthread_kill_implementation (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F577645: raise (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F5776EF: ??? (in /usr/lib64/libc.so.6)
==1542529==    by 0x4955FA2: PMPI_Comm_rank (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== HEAP SUMMARY:
==1542529==     in use at exit: 38,684,005 bytes in 310,595 blocks
==1542529==   total heap usage: 1,067,851 allocs, 757,256 frees, 105,080,647 bytes allocated
==1542529== 
==1542529== 5 bytes in 1 blocks are definitely lost in loss record 1,855 of 226,598
==1542529==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==1542529==    by 0x1F5D512E: strdup (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F7E2C0C: opal_common_ucx_mca_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AEF231: mca_pml_ucx_component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x44310C: main (main.cpp:48)
==1542529== 
==1542529== 5 bytes in 1 blocks are definitely lost in loss record 1,856 of 226,598
==1542529==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==1542529==    by 0x1F5D512E: strdup (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F780E38: register_variable (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F78218C: mca_base_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F7E2B02: opal_common_ucx_mca_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AEF231: mca_pml_ucx_component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== 10 bytes in 1 blocks are definitely lost in loss record 2,966 of 226,598
==1542529==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==1542529==    by 0x1F5D512E: strdup (in /usr/lib64/libc.so.6)
==1542529==    by 0x1FA78C14: pmix_rte_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x1FA20518: PMIx_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x493AA23: ompi_rte_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x49439C9: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x44310C: main (main.cpp:48)
==1542529== 
==1542529== 60 bytes in 1 blocks are definitely lost in loss record 114,568 of 226,598
==1542529==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==1542529==    by 0x1F5D512E: strdup (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F7E2BDC: opal_common_ucx_mca_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AEF231: mca_pml_ucx_component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x44310C: main (main.cpp:48)
==1542529== 
==1542529== 60 bytes in 1 blocks are definitely lost in loss record 114,569 of 226,598
==1542529==    at 0x484480F: malloc (vg_replace_malloc.c:442)
==1542529==    by 0x1F5D512E: strdup (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F780E38: register_variable (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F78218C: mca_base_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F7E2ABC: opal_common_ucx_mca_var_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AEF231: mca_pml_ucx_component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== 75 bytes in 1 blocks are definitely lost in loss record 151,679 of 226,598
==1542529==    at 0x484C184: realloc (vg_replace_malloc.c:1690)
==1542529==    by 0x1F5B795F: __vasprintf_internal (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F794E08: opal_vasprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F794EA6: opal_asprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AB7BDB: component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== 156 bytes in 1 blocks are definitely lost in loss record 198,172 of 226,598
==1542529==    at 0x484C184: realloc (vg_replace_malloc.c:1690)
==1542529==    by 0x1F5B795F: __vasprintf_internal (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F794E08: opal_vasprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F794EA6: opal_asprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AB7B8B: component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== 159 bytes in 1 blocks are definitely lost in loss record 198,200 of 226,598
==1542529==    at 0x484C184: realloc (vg_replace_malloc.c:1690)
==1542529==    by 0x1F5B795F: __vasprintf_internal (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F794E08: opal_vasprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F794EA6: opal_asprintf (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4AB7C28: component_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x1F77B9A1: mca_base_framework_components_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C11B: mca_base_framework_register (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x1F77C1CF: mca_base_framework_open (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libopen-pal.so.80.0.1)
==1542529==    by 0x4943A02: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x496B74D: PMPI_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== 464 bytes in 1 blocks are possibly lost in loss record 217,763 of 226,598
==1542529==    at 0x484BF70: calloc (vg_replace_malloc.c:1595)
==1542529==    by 0x4011652: UnknownInlinedFun (rtld-malloc.h:44)
==1542529==    by 0x4011652: allocate_dtv (dl-tls.c:401)
==1542529==    by 0x4012111: _dl_allocate_tls (dl-tls.c:679)
==1542529==    by 0x1F5C38C4: pthread_create@@GLIBC_2.34 (in /usr/lib64/libc.so.6)
==1542529==    by 0x1F946603: ucs_pthread_create (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F92CAF8: ??? (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F92CB49: ??? (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F92ADF9: ucs_async_set_event_handler (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F93D0FE: ??? (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F93D287: ucs_rcache_create (in /usr/lib64/libucs.so.0.0.0)
==1542529==    by 0x1F882BB2: ??? (in /usr/lib64/libucp.so.0.0.0)
==1542529==    by 0x1F882C00: ucp_mem_rcache_init (in /usr/lib64/libucp.so.0.0.0)
==1542529== 
==1542529== 464 bytes in 1 blocks are possibly lost in loss record 217,764 of 226,598
==1542529==    at 0x484BF70: calloc (vg_replace_malloc.c:1595)
==1542529==    by 0x4011652: UnknownInlinedFun (rtld-malloc.h:44)
==1542529==    by 0x4011652: allocate_dtv (dl-tls.c:401)
==1542529==    by 0x4012111: _dl_allocate_tls (dl-tls.c:679)
==1542529==    by 0x1F5C38C4: pthread_create@@GLIBC_2.34 (in /usr/lib64/libc.so.6)
==1542529==    by 0x1FA0CF38: pmix_thread_start (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x1FA79B3F: pmix_progress_thread_start (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x1FA78BB6: pmix_rte_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x1FA20518: PMIx_Init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libpmix.so.2.9.4)
==1542529==    by 0x493AA23: ompi_rte_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x49439C9: ompi_mpi_instance_init_common (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4944733: ompi_mpi_instance_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529==    by 0x4937017: ompi_mpi_init (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==1542529== 
==1542529== LEAK SUMMARY:
==1542529==    definitely lost: 530 bytes in 8 blocks
==1542529==    indirectly lost: 0 bytes in 0 blocks
==1542529==      possibly lost: 928 bytes in 2 blocks
==1542529==    still reachable: 38,682,547 bytes in 310,585 blocks
==1542529==                       of which reachable via heuristic:
==1542529==                         stdstring          : 6,226,941 bytes in 150,239 blocks
==1542529==         suppressed: 0 bytes in 0 blocks
==1542529== Reachable blocks (those to which a pointer was found) are not shown.
==1542529== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1542529== 
==1542529== For lists of detected and suppressed errors, rerun with: -s
==1542529== ERROR SUMMARY: 11 errors from 11 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

I am on university of Illinois's campus cluster. . I have the following modules loaded:

Currently Loaded Modules:
  1) lmod                6) intel/compiler-rt/2025.0.4
  2) os_paths            7) intel/mkl/2025.0
  3) StdEnv              8) gcc/13.3.0
  4) intel/mpi/2021.14   9) openmpi/5.0.1-gcc-13.3.0
  5) intel/tbb/2022.0   10) anaconda3/2024.10

It ran on another supercomputer we used called Delta, but it has been failing here in this campus HPC and I am not sure why. I have also opened a ticket with the HPC on campus, but am also opening one here in case you have any insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant