-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pybind11::cast is not thread safe #2765
Comments
@zachariahreed, that's interesting. Do you have concrete code that manages to show this, or how did you run into this? I was about to shout "But of course you should hold the GIL", but then you already thought of that. |
The GIL is held when you get control back. The problem comes when another thread with pybind11-wrapped code gets scheduled during the time when you don't have control. Then you end up in a situation like:
The specific code that triggered this was use of |
Argh, right, yes. So rather than |
I don't think that's sufficient. During the time between when thread 1 gets control back and before the pop, if additional temporaries are created they'll end up in thread2's "arena". The flow control is not linear, so I don't think there's any way a |
You're probably right... To the best of your knowledge right now (not asking you to dive into pybind11), is it just |
I don't fully understand the details of the casting implementation / when temporaries are created. But, as far as I can tell, this seems to be the only piece of global / singleton data mutated. I think it's probably just |
So, does this mean I cannot use openmp or intel tbb ? |
I believe I am experiencing the same issue, but via a slightly different mechanism. In my case, I am observing this issue via temporaries stored in the pybind11/include/pybind11/detail/type_caster_base.h Lines 709 to 714 in 617cb65
If calling such a function concurrently from multiple threads, and where that function releases the GIL, I am intermittently observing the argument reference becoming "invalidated", i.e. showing garble at some point after the GIL is released. The way I was able to observe this was by copying the argument into a local variable prior to releasing the GIL, then comparing the copy to the argument once the GIL is released. I'm able to reproduce this frequently with two threads calling the same function concurrently within a loop. Capturing and inspecting items from the So in summary, it appears to not be not thread-safe to pass arguments by reference, where those arguments have implicit conversions specified - due to the aforementioned thread-safety issues of |
OK, so here is a small example that reproduces the above issue. I used the pybind11 module:
Python test:
Example results:
|
I can confirm that using thread-local storage (TLS) is indeed a solution that will solve this problem. @zachariahreed, what specifically around performance were you concerned about? As I understand, there is some concern around TLS overhead when used in a dynamic module, but I'm wondering whether this is anything to be particularly concerned about here. Some benchmarks (see https://testbit.eu/2015/thread-local-storage-benchmark) suggest very minimal overhead on modern platforms. Obviously it would be prudent to run benchmarks for this particular case, but I'm curious what others' thoughts may be around this. From my brief look into this, using a
v2.7.1...daniel-smith:bugfix/loader_life_support Another approach might be to have an instance of a PyObject* list per call - wrapped in some kind of "context" that gets passed through with each function call here: pybind11/include/pybind11/pybind11.h Lines 802 to 805 in cb60ed4
That would avoid the use of TLS - but would also involve an arguably unfeasible amount of refactoring - as the instance would also need to make its way through to the type casters. Not sure if anyone else has any opinions on either of these approaches, or any alternative approaches. I personally feel that TLS seems appropriate, but would be good to see what others think. |
@daniel-smith Would welcome a PR on this front (with relevant tests). We do use TLS for the get_internals() struct, so it wouldn't be out of the question. It also wouldn't impact single_threaded performance as long as it's guarded WITH_THREADS like other TLS for get_internals() is. There also has been some refactoring wanted on the patients STLs to use unordered_maps or such for faster access in case of duplicates such as #1253 and this might be a good time to do both. |
@Skylion007 - ok thanks, I'll look into this. |
Tests both with and without the GIL held, and it uncovered an issue in pybind11 handling of std::string_view when the GIL is not held. BUG: pybind/pybind11#2765 The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment. PiperOrigin-RevId: 394314149
Tests both with and without the GIL held, and it uncovered an issue in pybind11 handling of std::string_view when the GIL is not held. BUG: pybind/pybind11#2765 The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment. PiperOrigin-RevId: 394314149
Tests both with and without the GIL held, and it uncovered an issue in pybind11 handling of std::string_view when the GIL is not held. BUG: pybind/pybind11#2765 The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment. PiperOrigin-RevId: 394314149
Given that the This benchmark suggests a TLS lookup may be ~12 ns, for example. http://david-grs.github.io/tls_performance_overhead_cost_linux/ |
Tests both with and without the GIL held, and it uncovered an issue in pybind11 handling of std::string_view when the GIL is not held. BUG: pybind/pybind11#2765 The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment. PiperOrigin-RevId: 394360131
I don't know much about threading in general, and this problem in particular, but if the current behavior is demonstrably incorrect (#2765 (comment)), it makes no sense at all to me to even think about relative runtime comparisons. The only meaningful comparison is between two correct implementations. My thinking:
|
@rwgk I think that sounds like a sensible approach. Does anyone see any reason to use Python TLS here rather than UPDATE:
|
The realization of thread_local is complex, generates awful code (try looking at this on godbolt.org), and has suffered from poor support on some platforms. Our reliance on Python TLS so far basically punts on the issue and relies on what is surely a working implementation underlying Python. It is also highly inadvisable to use It is much better to move initialization of this data structure to some static/global context (e.g. pybind11's |
I tried a similar thing in my workspace, and there were some cases where the python / c++ thread mapping appeared to be incorrect. I ended up with this for now:
|
I think that the thing to do is something more along the lines of a little internally linked list.
|
@laramiel - I wonder if you might be able to expand what you mean by this. It would be good to understand cases where a @wjakob - thanks for your comments on this. If there are some platforms where measurable inefficiencies have been observed, then avoiding |
@daniel-smith See my solution in the above commit. |
I disagree. The code is pretty concise when compared with the python thread api initialization, etc. I think that compiler writers have some motivation to improve it as well, so in my opinion it's preferable. |
* Fix thread safety for pybind11 loader_life_support Fixes issue: #2765 This converts the vector of PyObjects to either a single void* or a per-thread void* depending on the WITH_THREAD define. The new field is used by each thread to construct a stack of loader_life_support frames that can extend the life of python objects. The pointer is updated when the loader_life_support object is allocated (which happens before a call) as well as on release. Each loader_life_support maintains a set of PyObject references that need to be lifetime extended; this is done by storing them in a c++ std::unordered_set and clearing the references when the method completes. * Also update the internals version as the internal struct is no longer compatible * Add test demonstrating threading works correctly. It may be appropriate to run this under msan/tsan/etc. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test to use lifetime-extended references rather than std::string_view, as that's a C++ 17 feature. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make loader_life_support members private * Update version to dev2 * Update test to use python threading rather than concurrent.futures * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary env in test * Remove unnecessary pytest in test * Use native C++ thread_local in place of python per-thread data structures to retain compatability * clang-format test_thread.cpp * Add a note about debugging the py::cast() error * thread_test.py now propagates exceptions on join() calls. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused sys / merge * Update include order in test_thread.cpp * Remove spurious whitespace * Update comment / whitespace. * Address review comments * lint cleanup * Fix test IntStruct constructor. * Add explicit to constructor Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <skylion.aaron@gmail.com>
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct.
This revises the existing fix for pybind#2765 in pybind#3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct. Fix test_pytypes.py::test_issue2361 failure on PyPy3.7 Add github actions tests for unstable ABI
* Eliminate duplicate TLS keys for loader_life_support stack This revises the existing fix for #2765 in #3237 to reduce the amount of TLS storage used. The shared TLS key is stored in two different ways, depending on `PYBIND11_INTERNALS_VERSION`. If `PYBIND11_INTERNALS_VERSION == 4` (as is currently set), the TLS key is stored in the `internal::shared_data` map to avoid breaking ABI compatibility. If `PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in the `internals` struct. * Fix test_pytypes.py::test_issue2361 failure on PyPy3.7 * Add github actions tests for unstable ABI
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program). [XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension. Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences. PiperOrigin-RevId: 413922787
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program). [XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension. Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences. PiperOrigin-RevId: 420755696
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program). [XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension. Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences. PiperOrigin-RevId: 420755696 Change-Id: I5e2626ea1e82c046a847300bf6bbe94208007802
in particular use of
loader_life_support
seems problematic. This function keeps state in the globalloader_patient_stack
:pybind11/include/pybind11/detail/internals.h
Line 105 in b7dfe5c
loader_patient_stack
can become interleaved.While releasing the GIL in a conversion seems like a bad idea, it can be surprisingly hard to avoid. Any cast that results in the bytecode interpreter being run (either directly or through something like a
__del__
method) can cause python itself to release GIL (sys.setswitchinterval
).The text was updated successfully, but these errors were encountered: