Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'c10::Error' #7561

Closed
ghost opened this issue May 5, 2023 · 9 comments · Fixed by #7573
Closed

terminate called after throwing an instance of 'c10::Error' #7561

ghost opened this issue May 5, 2023 · 9 comments · Fixed by #7573

Comments

@ghost
Copy link

ghost commented May 5, 2023

🐛 Describe the bug

I've compiled torch and vision from their main branches. When running Automatic1111's webui for Stable Diffusion, I get the following error message:

terminate called after throwing an instance of 'c10::Error'
  what():  Tried to register an operator (image::decode_png(Tensor _0, int _1, bool _2) -> Tensor _0) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered by RegisterOperators. Original registration: registered by RegisterOperators
Exception raised from registerDef at /home/user/SD/pytorch/aten/src/ATen/core/dispatch/Dispatcher.cpp:207 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7fac4bcc06ee in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fac4bc7951d in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::Dispatcher::registerDef(c10::FunctionSchema, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<at::Tag, std::allocator<at::Tag> >) + 0x923 (0x7fac40b698c3 in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::RegisterOperators::registerOp_(c10::RegisterOperators::Options&&) + 0x47b (0x7fac40bba4eb in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::RegisterOperators::checkSchemaAndRegisterOp_(c10::RegisterOperators::Options&&) + 0x41c (0x7fac40bbbf8c in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: std::enable_if<c10::guts::is_function_type<at::Tensor (at::Tensor const&, long, bool)>::value&&(!std::is_same<at::Tensor (at::Tensor const&, long, bool), void (c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)>::value), c10::RegisterOperators&&>::type c10::RegisterOperators::op<at::Tensor (at::Tensor const&, long, bool)>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, at::Tensor (*)(at::Tensor const&, long, bool), c10::RegisterOperators::Options&&) && + 0xd0 (0x7fab92ff0ac0 in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torchvision-0.16.0a0+370134-py3.8-linux-x86_64.egg/torchvision/image.so)
frame #6: <unknown function> + 0xb030 (0x7fab92fec030 in /home/user/SD/stable-diffusion-webui/venv/lib/python3.8/site-packages/torchvision-0.16.0a0+370134-py3.8-linux-x86_64.egg/torchvision/image.so)
frame #7: <unknown function> + 0x647e (0x7fac4e2aa47e in /lib64/ld-linux-x86-64.so.2)
frame #8: <unknown function> + 0x6568 (0x7fac4e2aa568 in /lib64/ld-linux-x86-64.so.2)
frame #9: _dl_catch_exception + 0xe5 (0x7fac4df74c85 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0xdff6 (0x7fac4e2b1ff6 in /lib64/ld-linux-x86-64.so.2)
frame #11: _dl_catch_exception + 0x88 (0x7fac4df74c28 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0xe34e (0x7fac4e2b234e in /lib64/ld-linux-x86-64.so.2)
frame #13: <unknown function> + 0x906bc (0x7fac4de906bc in /lib/x86_64-linux-gnu/libc.so.6)
frame #14: _dl_catch_exception + 0x88 (0x7fac4df74c28 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: _dl_catch_error + 0x33 (0x7fac4df74cf3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x901ae (0x7fac4de901ae in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: dlopen + 0x48 (0x7fac4de90748 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: <unknown function> + 0x14d41 (0x7fac4bdabd41 in /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so)
frame #19: PyCFunction_Call + 0xe6 (0x522446 in python3.8)
frame #20: _PyObject_MakeTpCall + 0x313 (0x50ca93 in python3.8)
frame #21: _PyEval_EvalFrameDefault + 0x4a07 (0x5083a7 in python3.8)
frame #22: _PyEval_EvalCodeWithName + 0x2fa (0x5027da in python3.8)
frame #23: _PyObject_FastCallDict + 0x1fc (0x50c27c in python3.8)
frame #24: python3.8() [0x51ea03]
frame #25: _PyObject_MakeTpCall + 0x32b (0x50caab in python3.8)
frame #26: _PyEval_EvalFrameDefault + 0x4fe1 (0x508981 in python3.8)
frame #27: python3.8() [0x521e36]
frame #28: _PyEval_EvalFrameDefault + 0x4af1 (0x508491 in python3.8)
frame #29: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #30: _PyEval_EvalFrameDefault + 0x3a3 (0x503d43 in python3.8)
frame #31: _PyEval_EvalCodeWithName + 0x2fa (0x5027da in python3.8)
frame #32: PyEval_EvalCode + 0x27 (0x5d69a7 in python3.8)
frame #33: python3.8() [0x5dabd1]
frame #34: python3.8() [0x515470]
frame #35: PyVectorcall_Call + 0x2c4 (0x522794 in python3.8)
frame #36: _PyEval_EvalFrameDefault + 0x5df5 (0x509795 in python3.8)
frame #37: _PyEval_EvalCodeWithName + 0x2fa (0x5027da in python3.8)
frame #38: _PyFunction_Vectorcall + 0x1ad (0x51499d in python3.8)
frame #39: _PyEval_EvalFrameDefault + 0x4af1 (0x508491 in python3.8)
frame #40: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #41: _PyEval_EvalFrameDefault + 0x6c6 (0x504066 in python3.8)
frame #42: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #43: _PyEval_EvalFrameDefault + 0x3a3 (0x503d43 in python3.8)
frame #44: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #45: _PyEval_EvalFrameDefault + 0x3a3 (0x503d43 in python3.8)
frame #46: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #47: python3.8() [0x514264]
frame #48: _PyObject_CallMethodIdObjArgs + 0xe8 (0x524248 in python3.8)
frame #49: PyImport_ImportModuleLevelObject + 0x462 (0x5236b2 in python3.8)
frame #50: _PyEval_EvalFrameDefault + 0x3c74 (0x507614 in python3.8)
frame #51: _PyEval_EvalCodeWithName + 0x2fa (0x5027da in python3.8)
frame #52: PyEval_EvalCode + 0x27 (0x5d69a7 in python3.8)
frame #53: python3.8() [0x5dabd1]
frame #54: python3.8() [0x515470]
frame #55: PyVectorcall_Call + 0x2c4 (0x522794 in python3.8)
frame #56: _PyEval_EvalFrameDefault + 0x5df5 (0x509795 in python3.8)
frame #57: _PyEval_EvalCodeWithName + 0x2fa (0x5027da in python3.8)
frame #58: _PyFunction_Vectorcall + 0x1ad (0x51499d in python3.8)
frame #59: _PyEval_EvalFrameDefault + 0x4af1 (0x508491 in python3.8)
frame #60: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #61: _PyEval_EvalFrameDefault + 0x6c6 (0x504066 in python3.8)
frame #62: _PyFunction_Vectorcall + 0x10f (0x5148ff in python3.8)
frame #63: _PyEval_EvalFrameDefault + 0x3a3 (0x503d43 in python3.8)

Aborted (core dumped)

I'm not sure if the bug lies with Automatic1111 or vision, or even if it's a bug at all, but I'm trying here first.

I'm running Ubuntu 22.04.2 and I have a 7600X CPU and a 7900 XTX GPU, if that matters. I'm also using ROCm 5.5.

Versions

$ python3.8 collect_env.py 
Collecting environment information...
Traceback (most recent call last):
  File "collect_env.py", line 606, in <module>
    main()
  File "collect_env.py", line 589, in main
    output = get_pretty_env_info()
  File "collect_env.py", line 584, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "collect_env.py", line 437, in get_env_info
    hip_runtime_version = [s.rsplit(None, 1)[-1] for s in cfg if 'HIP Runtime' in s][0]
IndexError: list index out of range

cc @jeffdaily @jithunnair-amd

@pmeier
Copy link
Collaborator

pmeier commented May 8, 2023

Looks like an issue of how we are registering our decoding ops:

terminate called after throwing an instance of 'c10::Error'
  what():  Tried to register an operator (image::decode_png(Tensor _0, int _1, bool _2) -> Tensor _0) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered by RegisterOperators. Original registration: registered by RegisterOperators

@pmeier
Copy link
Collaborator

pmeier commented May 8, 2023

Not being able to run the collection for the environment in your setup is concerning. I don't have a ROCm box ATM. @malfet could you have a look?

@justinkb
Copy link
Contributor

justinkb commented May 10, 2023

seeing this too on NixOS with rocm 5.4, RX 6800 gpu. issue is unrelated to Automatic1111 since I am not using that at all.

pytorch 2.0.1 and torchvision 0.15.2 respectively, by the way. (also happened with 2.0.0 and 0.15.1)

@pmeier
Copy link
Collaborator

pmeier commented May 10, 2023

@justinkb Does the environment collection script work for you?

python -m 'torch.utils.collect_env'

@justinkb
Copy link
Contributor

I doubt it would work, due to the peculiarities of Nix. I did some digging, however, it looks like ninja ends up generating build instructions like this with hipified sources:

build /build/source/build/temp.linux-x86_64-cpython-310/build/source/torchvision/csrc/io/image/image.o: compile /build/source/torchvision/csrc/io/image/image.cpp
build /build/source/build/temp.linux-x86_64-cpython-310/build/source/torchvision/csrc/io/image/image_hip.o: compile /build/source/torchvision/csrc/io/image/image_hip.cpp

That is not what is supposed to happen I think. The image lib ends up with two objects that both do this https://github.com/pytorch/vision/blob/main/torchvision/csrc/io/image/image.cpp#LL22C2-L22C2

@justinkb
Copy link
Contributor

justinkb commented May 10, 2023

So it ends up happening because of the cuda jpeg decode in image.h, that gets hipified on ROCm. I can disable jpeg decode in my build to verify if the issue disappears completely then. (as a verification only, not a fix) edit: tried this, it still hipified the relevant sources anyway, so that didn't prove or disprove anything

@justinkb
Copy link
Contributor

Fixed by adding

os.remove(os.path.join(this_dir, "torchvision", "csrc", "io", "image", "image.cpp"))
os.remove(os.path.join(this_dir, "torchvision", "csrc", "io", "image", "image.h"))

in setup.py below the hipify_python.hipify invocation.

@malfet
Copy link
Contributor

malfet commented May 10, 2023

Will take care of fixing/adding tests for collect_env...

@ghost
Copy link
Author

ghost commented May 10, 2023

Fixed by adding

os.remove(os.path.join(this_dir, "torchvision", "csrc", "io", "image", "image.cpp"))
os.remove(os.path.join(this_dir, "torchvision", "csrc", "io", "image", "image.h"))

in setup.py below the hipify_python.hipify invocation.

I can confirm that this fixes the issue.

pytorchupdatebot pushed a commit to pytorch/pytorch that referenced this issue May 19, 2023
Should prevent broken collect_env reporting as shown in pytorch/vision#7561 (comment)

copilot:poem
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue May 19, 2023
Should prevent broken collect_env reporting as shown in pytorch/vision#7561 (comment)

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 5204e0f</samp>

> _`get_version_or_na`_
> _Helper function refactors_
> _Code like autumn leaves_

Pull Request resolved: #101844
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants