Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption error #840

Closed
anshumang opened this issue Apr 29, 2015 · 12 comments
Closed

Memory corruption error #840

anshumang opened this issue Apr 29, 2015 · 12 comments

Comments

@anshumang
Copy link

Has anyone seen this before?

Starting program: /home/agoswami/computationalRadiationPhysics/release-branch/build-temp/build_picongpu/picongpu -g 32 32 32 -d 1 1 1 -s 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffecc1f700 (LWP 14837)]
[New Thread 0x7fffebbfe700 (LWP 14838)]
[New Thread 0x7fffea471700 (LWP 14842)]
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78fb7c4 in opal_memory_ptmalloc2_int_malloc () from /usr/lib/libmpi.so.1
(gdb) bt
#0  0x00007ffff78fb7c4 in opal_memory_ptmalloc2_int_malloc () from /usr/lib/libmpi.so.1
#1  0x00007ffff78fdaf5 in opal_memory_ptmalloc2_int_memalign () from /usr/lib/libmpi.so.1
#2  0x00007ffff78fdf3c in opal_memory_ptmalloc2_memalign () from /usr/lib/libmpi.so.1
#3  0x00007ffff6497f2d in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff6498029 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000000000076a01f in PMacc::nvidia::memory::MemoryInfo::isSharedMemoryPool (this=0xdfdd98 <PMacc::nvidia::memory::MemoryInfo::getInstance()::instance>)
    at /home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/picongpu/../libPMacc/include/nvidia/memory/MemoryInfo.hpp:88
#6  0x0000000000788629 in picongpu::MySimulation::init (this=0xf1d3a0) at /home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/picongpu/include/simulationControl/MySimulation.hpp:276
#7  0x00000000007c32e2 in PMacc::SimulationHelper<3u>::startSimulation (this=0xf1d3a0)
    at /home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/picongpu/../libPMacc/include/simulationControl/SimulationHelper.hpp:180
#8  0x00000000007a814f in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (this=0x7fffffffe320)
    at /home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/picongpu/include/simulationControl/SimulationStarter.hpp:86
#9  0x000000000075b0e4 in main (argc=11, argv=0x7fffffffe458) at /home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/picongpu/main.cu:56
@ax3l ax3l added the question label Apr 29, 2015
@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

Interesting, seems to hang in the isSharedMemoryPool check which allocates and free's some memory to find out if you are on a SoC-like device, such as Jetson TK1.

but since it failes on new it sounds like heap corruption to me...

What host system (OS, compiler & RAM) and GPU are you using and how much memory do both have?

@ax3l ax3l added this to the Open Beta milestone Apr 29, 2015
@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

also, can you try to run valgrind on that?

@anshumang
Copy link
Author

Host =>
OS : Ubuntu 14.04.1 LTS
Compiler : g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
RAM :
MemTotal: 12292376 kB
MemFree: 6172092 kB

GPU (Using the K40c) =>
Wed Apr 29 12:24:08 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.29 Driver Version: 346.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla C2050 Off | 0000:08:00.0 Off | Off |
| 30% 44C P0 N/A / N/A | 6MiB / 3071MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c Off | 0000:82:00.0 Off | 0 |
| 23% 29C P0 64W / 235W | 23MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

@anshumang
Copy link
Author

@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

can you post the output of cmake -L . in the build dir, too?

@anshumang
Copy link
Author

agoswami@shiva:~/computationalRadiationPhysics/release-branch/build-temp$ cmake -L
CMake Error: The source directory "/home/agoswami/computationalRadiationPhysics/release-branch/build-temp" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
-- Cache values
ADIOS_CONFIG:FILEPATH=ADIOS_CONFIG-NOTFOUND
CMAKE_BUILD_TYPE:STRING=
CMAKE_INSTALL_PREFIX:PATH=/home/agoswami/computationalRadiationPhysics/release-branch/param-sets-temp/KH
CUDA_ARCH:STRING=sm_20
CUDA_BUILD_CUBIN:BOOL=OFF
CUDA_BUILD_EMULATION:BOOL=OFF
CUDA_FTZ:STRING=--ftz=false
CUDA_HOST_COMPILER:FILEPATH=/usr/bin/cc
CUDA_KEEP_FILES:BOOL=OFF
CUDA_MATH:STRING=--use_fast_math
CUDA_MEMTEST_DIR:PATH=/home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/cuda_memtest
CUDA_MEMTEST_RELEASE:BOOL=ON
CUDA_SDK_ROOT_DIR:PATH=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_SEPARABLE_COMPILATION:BOOL=OFF
CUDA_SHOW_CODELINES:BOOL=OFF
CUDA_SHOW_REGISTER:BOOL=OFF
CUDA_TOOLKIT_ROOT_DIR:PATH=/usr/local/cuda
CUDA_VERBOSE_BUILD:BOOL=OFF
MPI_EXTRA_LIBRARY:STRING=/usr/lib/libmpi.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libhwloc.so
MPI_INFO_DIR:PATH=/home/agoswami/computationalRadiationPhysics/release-branch/picongpu/src/mpiInfo
MPI_LIBRARY:FILEPATH=/usr/lib/libmpi_cxx.so
PIC_COPY_ON_INSTALL:STRING=include/simulation_defines;submit
PIC_ENABLE_INSITU_VOLVIS:BOOL=OFF
PIC_EXTENSION_PATH:PATH=/home/agoswami/computationalRadiationPhysics/release-branch/param-sets-temp/KH
PIC_RELEASE:BOOL=OFF
PIC_VERBOSE:STRING=1
PMACC_BLOCKING_KERNEL:BOOL=OFF
PMACC_VERBOSE:STRING=0
PNGwriter_ROOT_DIR:PATH=PNGwriter_ROOT_DIR-NOTFOUND
SCOREP_ENABLE:BOOL=OFF
Splash_ROOT_DIR:PATH=Splash_ROOT_DIR-NOTFOUND
VAMPIR_ENABLE:BOOL=OFF
VT_INST_FILE_FILTER:STRING=stl,usr/include,libgpugrid,vector_types.h,Vector.hpp,DeviceBuffer.hpp,DeviceBufferIntern.hpp,Buffer.hpp,StrideMapping.hpp,StrideMappingMethods.hpp,MappingDescription.hpp,AreaMapping.hpp,AreaMappingMethods.hpp,ExchangeMapping.hpp,ExchangeMappingMethods.hpp,DataSpace.hpp,Manager.hpp,Manager.tpp,Transaction.hpp,Transaction.tpp,TransactionManager.hpp,TransactionManager.tpp,Vector.tpp,Mask.hpp,ITask.hpp,EventTask.hpp,EventTask.tpp,StandartAccessor.hpp,StandartNavigator.hpp,HostBuffer.hpp,HostBufferIntern.hpp
VT_INST_FUNC_FILTER:STRING=vector,Vector,dim3,GPUGrid,execute,allocator,Task,Manager,Transaction,Mask,operator,DataSpace,PitchedBox,Event,new,getGridDim,GetCurrentDataSpaces,MappingDescription,getOffset,getParticlesBuffer,getDataSpace,getInstance

@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

Are you running additional tasks on this machine?

The host only has 12 GB of memory and only 6 GB are left... the K40c has 12 GB of memory and one GPU but we usually assume the host has at least the same ammount of memory in RAM that we can use for double-buffering.

@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

Probably the easy solution to your problem: double your host memory (or allocate something on the GPU so PIConGPU can only use half of the 6GB - which would be a pity!)

@anshumang
Copy link
Author

Oh, that is the problem then....this is a shared machine and some other student is running some large tasks...I could probably use the C2050 with 3GB memory, no?

@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

sure! just change the environment value in CUDA_VISIBLE_DEVICES

@anshumang
Copy link
Author

thanks for your help in finding the problem 👍

@ax3l
Copy link
Member

ax3l commented Apr 29, 2015

your are welcome :)

when designing systems for now, try to add at least the same amount of memory in the host as you have on the device. host memory is comparably cheap, so adding twice the amount that one has in devices of a node is a good idea and allows for neat tricks like time-averaging of large data sets (not yet implemented).

@ax3l ax3l closed this as completed Apr 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants