Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after test pytorch #8841

Closed
4 tasks done
Secbone opened this issue Jul 1, 2021 · 8 comments
Closed
4 tasks done

Segmentation fault after test pytorch #8841

Secbone opened this issue Jul 1, 2021 · 8 comments
Labels
status: needs information reporter needs to provide more information; can be closed after 2 or more weeks of inactivity type: question general question, might be closed after 2 weeks of inactivity

Comments

@Secbone
Copy link

Secbone commented Jul 1, 2021

python 3.9, CentOS, I use pytest with pytorch, after test case runs success, I got Segmentation fault error, here is the case in github actions.

test case

  • a detailed description of the bug or problem you are having
  • output of pip list from the virtual environment you are using
  • pytest and operating system versions
  • minimal example if possible
@The-Compiler
Copy link
Member

This is most certainly not an issue with pytest, but with your code under test (or rather one of your libraries, in native code) doing something wrong.

Usually pytest's faulthandler should give you a stacktrace, but perhaps it was already disabled here at the point the crash happened. Can you try invoking python3 -X faulthandler -m pytest -p no:faulthandler -x toad instead and see if that gives you more information?

@The-Compiler The-Compiler added status: needs information reporter needs to provide more information; can be closed after 2 or more weeks of inactivity type: question general question, might be closed after 2 weeks of inactivity labels Jul 1, 2021
@Secbone
Copy link
Author

Secbone commented Jul 5, 2021

@The-Compiler I have replaced the command with python3 -X faulthandler -m pytest -p no:faulthandler -x toad/nn/module_test.py, but still no more details. please have a look.

test case

@The-Compiler
Copy link
Member

Hm, that's unfortunate. I'm assuming adding -s (to disable pytest's output capturing) won't change anything?

@Secbone
Copy link
Author

Secbone commented Jul 5, 2021

@The-Compiler It still no more detail, the case is here 😞

@The-Compiler
Copy link
Member

Perhaps you can run this snippet after pytest to give you a gdb stacktrace after the crash:

#!/bin/bash
find . \( -name "*.core" -o -name core \) -exec gdb --batch --quiet -ex "thread apply all bt" "$(readlink -f $(which python3))" {} \;`

or run Python under gdb directly:

gdb --batch --quiet -ex r -ex "thread apply all bt" --args "$(readlink -f $(which python3))" -m pytest -p no:faulthandler -x toad/nn/module_test.py"

not sure if those will work 1:1, I didn't really test them.

@Secbone
Copy link
Author

Secbone commented Jul 5, 2021

@The-Compiler here is the gdb information

@The-Compiler
Copy link
Member

This shows that the segfault happens at

#0  0x00007ffff7e42ef3 in PyThreadState_Clear (tstate=0x7fffac000b60) at Python/pystate.c:785
#1  0x00007fffcf33022c in pybind11::gil_scoped_acquire::dec_ref() () from /opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#2  0x00007fffcf330269 in pybind11::gil_scoped_acquire::~gil_scoped_acquire() () from /opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#3  0x00007fffcf647049 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#4  0x00007fffcec916df in execute_native_thread_routine () from /opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/torch/lib/libtorch.so
#5  0x00007ffff79e3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007ffff7b1f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Which looks like one of those PyTorch issues:

The first one seems like the closest one - it looks like it's been fixed in PyTorch 1.8, so you might want to try upgrading (according to your output you're still running 1.7.1). In any case, I'm closing this as it doesn't seem to specific to pytest in any way.

@Secbone
Copy link
Author

Secbone commented Jul 6, 2021

@The-Compiler Thanks a lot! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs information reporter needs to provide more information; can be closed after 2 or more weeks of inactivity type: question general question, might be closed after 2 weeks of inactivity
Projects
None yet
Development

No branches or pull requests

2 participants