Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pybind11::cast is not thread safe #2765

Open
zachariahreed opened this issue Dec 31, 2020 · 21 comments
Open

[BUG] pybind11::cast is not thread safe #2765

zachariahreed opened this issue Dec 31, 2020 · 21 comments
Labels
bug casters Related to casters, and to be taken into account when analyzing/reworking casters help wanted

Comments

@zachariahreed
Copy link

in particular use of loader_life_support seems problematic. This function keeps state in the global loader_patient_stack:

std::vector<PyObject *> loader_patient_stack; // Used by `loader_life_support`
In the case where a cast can release the GIL, two thread's push/pop of loader_patient_stack can become interleaved.

While releasing the GIL in a conversion seems like a bad idea, it can be surprisingly hard to avoid. Any cast that results in the bytecode interpreter being run (either directly or through something like a __del__ method) can cause python itself to release GIL (sys.setswitchinterval).

@zachariahreed zachariahreed changed the title [BUG] pybind11:cast is not thread safe [BUG] pybind11::cast is not thread safe Dec 31, 2020
@YannickJadoul
Copy link
Collaborator

@zachariahreed, that's interesting. Do you have concrete code that manages to show this, or how did you run into this?

I was about to shout "But of course you should hold the GIL", but then you already thought of that.
However, by the time the interpreter stops running and hands back control to pybind11, I'd expect the GIL to be held again? Do you mind explaining slightly more in depth how the GIL can be released without pybind11 knowing about it?

@zachariahreed
Copy link
Author

zachariahreed commented Dec 31, 2020

The GIL is held when you get control back. The problem comes when another thread with pybind11-wrapped code gets scheduled during the time when you don't have control. Then you end up in a situation like:

Thread1          Thread2
--------------   --------------
Push             ...
  Cast           ...
    ByteCode     ...
       Yield     ...
         ...     Push
         ...       ByteCode
         ...         Yield
Pop -> Boom      ...

The specific code that triggered this was use of pybind11::module::import inside of a custom caster. Much of the import machinery is written in pure python these days and its complex enough that it can certainly run longer than the multiplexing period.

@YannickJadoul
Copy link
Collaborator

Argh, right, yes. So rather than poping, we should remember which pointer was added?

@zachariahreed
Copy link
Author

I don't think that's sufficient. During the time between when thread 1 gets control back and before the pop, if additional temporaries are created they'll end up in thread2's "arena". The flow control is not linear, so I don't think there's any way a vector-based strategy can work. I looked very briefly at moving this into thread-local storage, but I was a bit concerned about performance as this seems to be a hot path.

@YannickJadoul
Copy link
Collaborator

You're probably right... To the best of your knowledge right now (not asking you to dive into pybind11), is it just loader_patient_stack (which could get a reasonably quick fix, if we have a way of trying this out) or should we do a full investigation of what else can go wrong?

@zachariahreed
Copy link
Author

I don't fully understand the details of the casting implementation / when temporaries are created. But, as far as I can tell, this seems to be the only piece of global / singleton data mutated. I think it's probably just loader_patient_stack.

@jokervTv
Copy link

So, does this mean I cannot use openmp or intel tbb ?

@daniel-smith
Copy link

daniel-smith commented Aug 27, 2021

I believe I am experiencing the same issue, but via a slightly different mechanism. In my case, I am observing this issue via temporaries stored in the loader_patient_stack via the argument_loader. Specifically for types where py::implicitly_convertible is used, and where arguments of those types are passed by reference to the bound function. Argument objects where an implicit conversion is performed are stored in this stack to keep them alive.

// Perform an implicit conversion
if (convert) {
for (auto &converter : typeinfo->implicit_conversions) {
auto temp = reinterpret_steal<object>(converter(src.ptr(), typeinfo->type));
if (load_impl<ThisT>(temp, false)) {
loader_life_support::add_patient(temp);

If calling such a function concurrently from multiple threads, and where that function releases the GIL, I am intermittently observing the argument reference becoming "invalidated", i.e. showing garble at some point after the GIL is released. The way I was able to observe this was by copying the argument into a local variable prior to releasing the GIL, then comparing the copy to the argument once the GIL is released. I'm able to reproduce this frequently with two threads calling the same function concurrently within a loop. Capturing and inspecting items from the loader_patient_stack before and after the GIL is released show the current thread is basically observing the pushing/popping of the stack from other threads. The upshot is that if we have an argument passed by reference that is backed by an item in the loader_patient_stack, that memory can be wiped out by another thread if the GIL is released - rendering the passed-by-reference argument invalid.

So in summary, it appears to not be not thread-safe to pass arguments by reference, where those arguments have implicit conversions specified - due to the aforementioned thread-safety issues of loader_patient_stack.

@daniel-smith
Copy link

OK, so here is a small example that reproduces the above issue. I used the cmake_example project as a starting point, hence the naming.

pybind11 module:

#include <pybind11/pybind11.h>

#include <chrono>
#include <thread>

namespace py = pybind11;
using namespace std::chrono_literals;

struct Bar
{
    int value;
};

void foo(const Bar& bar)
{
    const auto copy = bar;

    {
        py::gil_scoped_release release;

        // Simulate long running task.
        std::this_thread::sleep_for(100ms);
    }

    if (bar.value != copy.value)
    {
        throw std::exception("Reference changed!!");
    }
}

PYBIND11_MODULE(cmake_example, m)
{
    py::class_<Bar>(m, "Bar")
        .def(py::init([](const int i) { return Bar{i}; }));

    py::implicitly_convertible<int, Bar>();

    m.def("foo", &foo);
}

Python test:

import cmake_example as c

import pytest
from concurrent.futures.thread import ThreadPoolExecutor


@pytest.mark.parametrize('repeat', range(10))
def test(repeat):
    thread_count = 2
    call_count = 10
    values = [i for i in range(call_count)]

    with ThreadPoolExecutor(max_workers=thread_count) as executor:
        results = [executor.submit(c.foo, i) for i in values]

    # Materialise the results - exception will be thrown from 'foo' when reference is invalid.
    results = list(r.result() for r in results)

Example results:

========================================================================= short test summary info =========================================================================
FAILED main.py::test[1] - RuntimeError: Reference changed!!
FAILED main.py::test[3] - RuntimeError: Reference changed!!
FAILED main.py::test[4] - RuntimeError: Reference changed!!
FAILED main.py::test[5] - RuntimeError: Reference changed!!
FAILED main.py::test[7] - RuntimeError: Reference changed!!
FAILED main.py::test[8] - RuntimeError: Reference changed!!
======================================================================= 6 failed, 4 passed in 5.27s =======================================================================

@Skylion007 Skylion007 added casters Related to casters, and to be taken into account when analyzing/reworking casters help wanted labels Aug 28, 2021
@daniel-smith
Copy link

daniel-smith commented Aug 30, 2021

I looked very briefly at moving this into thread-local storage, but I was a bit concerned about performance as this seems to be a hot path.

I can confirm that using thread-local storage (TLS) is indeed a solution that will solve this problem. @zachariahreed, what specifically around performance were you concerned about? As I understand, there is some concern around TLS overhead when used in a dynamic module, but I'm wondering whether this is anything to be particularly concerned about here. Some benchmarks (see https://testbit.eu/2015/thread-local-storage-benchmark) suggest very minimal overhead on modern platforms. Obviously it would be prudent to run benchmarks for this particular case, but I'm curious what others' thoughts may be around this.

From my brief look into this, using a thread_local variable to replace the patient_loader_stack from the "internals" would probably involve the least amount of change. E.g. something like the following in the loader_life_support class:

static std::vector<PyObject*>& loader_patient_stack()
{
    thread_local std::vector<PyObject*> stack;
    return stack;
}

v2.7.1...daniel-smith:bugfix/loader_life_support

Another approach might be to have an instance of a PyObject* list per call - wrapped in some kind of "context" that gets passed through with each function call here:

// 6. Call the function.
try {
loader_life_support guard{};
result = func.impl(call);

That would avoid the use of TLS - but would also involve an arguably unfeasible amount of refactoring - as the instance would also need to make its way through to the type casters. Not sure if anyone else has any opinions on either of these approaches, or any alternative approaches. I personally feel that TLS seems appropriate, but would be good to see what others think.

@Skylion007
Copy link
Collaborator

@daniel-smith Would welcome a PR on this front (with relevant tests). We do use TLS for the get_internals() struct, so it wouldn't be out of the question. It also wouldn't impact single_threaded performance as long as it's guarded WITH_THREADS like other TLS for get_internals() is.

There also has been some refactoring wanted on the patients STLs to use unordered_maps or such for faster access in case of duplicates such as #1253 and this might be a good time to do both.

@daniel-smith
Copy link

@Skylion007 - ok thanks, I'll look into this.

copybara-service bot pushed a commit to pybind/pybind11_protobuf that referenced this issue Sep 1, 2021
Tests both with and without the GIL held, and it uncovered an issue in
pybind11 handling of std::string_view when the GIL is not held.

BUG: pybind/pybind11#2765

The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment.

PiperOrigin-RevId: 394314149
copybara-service bot pushed a commit to pybind/pybind11_protobuf that referenced this issue Sep 2, 2021
Tests both with and without the GIL held, and it uncovered an issue in
pybind11 handling of std::string_view when the GIL is not held.

BUG: pybind/pybind11#2765

The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment.

PiperOrigin-RevId: 394314149
copybara-service bot pushed a commit to pybind/pybind11_protobuf that referenced this issue Sep 2, 2021
Tests both with and without the GIL held, and it uncovered an issue in
pybind11 handling of std::string_view when the GIL is not held.

BUG: pybind/pybind11#2765

The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment.

PiperOrigin-RevId: 394314149
@jbms
Copy link
Contributor

jbms commented Sep 2, 2021

Given that the dispatcher function that uses loader_life_support looks to be quite expensive already (lots of memory allocation, Python API calls, etc.), not to mention the cost of all of the individual type_casters, I would be extremely surprised if a single TLS lookup adds any noticeable overhead at all.

This benchmark suggests a TLS lookup may be ~12 ns, for example.

http://david-grs.github.io/tls_performance_overhead_cost_linux/

copybara-service bot pushed a commit to pybind/pybind11_protobuf that referenced this issue Sep 2, 2021
Tests both with and without the GIL held, and it uncovered an issue in
pybind11 handling of std::string_view when the GIL is not held.

BUG: pybind/pybind11#2765

The bug is that the string_view constructs an object which should be placed on life_support, but due to the underlying mechanism it is not retained in a multi-threaded environment.

PiperOrigin-RevId: 394360131
@rwgk
Copy link
Collaborator

rwgk commented Sep 2, 2021

I don't know much about threading in general, and this problem in particular, but if the current behavior is demonstrably incorrect (#2765 (comment)), it makes no sense at all to me to even think about relative runtime comparisons. The only meaningful comparison is between two correct implementations.

My thinking:

  • Rename internals::loader_patient_stack to deprecated_threading_incompatible_loader_patient_stack (or similar). But do not delete it to not break ABI compatibility just for this. (We can remove it later when we're breaking compatibility anyway for other reasons.)
  • Add the suggested ([BUG] pybind11::cast is not thread safe #2765 (comment)) thread_local function, maybe with name thread_local_loader_patient_stack for clarity.
  • Be happy.
  • Wait for someone to care enough to send a PR with a faster implementation. (My bet: it will never happen.)

@daniel-smith
Copy link

daniel-smith commented Sep 2, 2021

@rwgk I think that sounds like a sensible approach. Does anyone see any reason to use Python TLS here rather than thread_local? I note that the internals struct is currently using Python TLS for tstate, which is why I ask.

UPDATE:
Depending on the answer to this question, thinking that the function can be a static method on the loader_life_support class that also takes into account WITH_THREAD, e.g.

private:
    static std::vector<PyObject *> &loader_patient_stack() {
#if defined(WITH_THREAD)
        thread_local auto stack = std::vector<PyObject *>{};
#else
        static auto stack = std::vector<PyObject *>{};
#endif
        return stack;

@wjakob
Copy link
Member

wjakob commented Sep 2, 2021

The realization of thread_local is complex, generates awful code (try looking at this on godbolt.org), and has suffered from poor support on some platforms. Our reliance on Python TLS so far basically punts on the issue and relies on what is surely a working implementation underlying Python. It is also highly inadvisable to use static variables (esp. when they have a C++ constructor) within methods -- same story, look at godbolt.org to see what happens.

It is much better to move initialization of this data structure to some static/global context (e.g. pybind11's internals data structure), thus implicitly relying on GIL for exclusive access, and being able to sidestep the expensive locking / TLS prelude and initialization checks.

@laramiel
Copy link
Contributor

laramiel commented Sep 2, 2021

I tried a similar thing in my workspace, and there were some cases where the python / c++ thread mapping appeared to be incorrect. I ended up with this for now:

diff -u -r a/include/pybind11/detail/internals.h b/include/pybind11/detail/internals.h
--- a/include/pybind11/detail/internals.h
+++ b/include/pybind11/detail/internals.h
@@ -107,13 +107,15 @@
     std::unordered_map<const PyObject *, std::vector<PyObject *>> patients;
     std::forward_list<ExceptionTranslator> registered_exception_translators;
     std::unordered_map<std::string, void *> shared_data; // Custom data to be shared across extensions
-    std::vector<PyObject *> loader_patient_stack; // Used by `loader_life_support`
     std::forward_list<std::string> static_strings; // Stores the std::strings backing detail::c_str()
     PyTypeObject *static_property_type;
     PyTypeObject *default_metaclass;
     PyObject *instance_base;
-#if defined(WITH_THREAD)
+#if !defined(WITH_THREAD)
+    std::vector<PyObject *> loader_patient_stack;
+#else  // defined(WITH_THREAD)
     PYBIND11_TLS_KEY_INIT(tstate);
+    PYBIND11_TLS_KEY_INIT(patient_stack_key);
     PyInterpreterState *istate = nullptr;
     ~internals() {
         // This destructor is called *after* Py_Finalize() in finalize_interpreter().
@@ -124,6 +126,7 @@
         // of those have anything to do with CPython internals.
         // PyMem_RawFree *requires* that the `tstate` be allocated with the CPython allocator.
         PYBIND11_TLS_FREE(tstate);
+        PYBIND11_TLS_FREE(patient_stack_key);
     }
 #endif
 };
@@ -301,6 +304,7 @@
         #if PY_VERSION_HEX < 0x03090000
                 PyEval_InitThreads();
         #endif
+
         PyThreadState *tstate = PyThreadState_Get();
         #if PY_VERSION_HEX >= 0x03070000
             internals_ptr->tstate = PyThread_tss_alloc();
@@ -314,7 +318,27 @@
             PyThread_set_key_value(internals_ptr->tstate, tstate);
         #endif
         internals_ptr->istate = tstate->interp;
+
+        // begin local modification:
+#if PY_VERSION_HEX >= 0x03070000
+        internals_ptr->patient_stack_key = PyThread_tss_alloc();
+        if (!internals_ptr->patient_stack_key ||
+            (PyThread_tss_create(internals_ptr->patient_stack_key) != 0))
+          pybind11_fail(
+              "get_internals: could not successfully initialize the "
+              "patient_stack TLS key!");
+#else
+        internals_ptr->patient_stack_key = PyThread_create_key();
+        if (internals_ptr->patient_stack_key == -1) {
+          pybind11_fail(
+              "get_internals: could not successfully initialize the "
+              "patient_stack TLS key!");
+        }
 #endif
+        // end local modification:
+
+#endif  // defined(WITH_THREAD)
+
         builtins[id] = capsule(internals_pp);
         internals_ptr->registered_exception_translators.push_front(&translate_exception);
         internals_ptr->static_property_type = make_static_property_type();
@@ -342,6 +366,28 @@
   return locals;
 }
 
+// begin local modification:
+inline std::vector<PyObject *> &get_loader_patient_stack() {
+#if !defined(WITH_THREAD)
+  return get_internals().loader_patient_stack;
+#else
+  auto &internals = get_internals();
+  assert(internals.patient_stack_key != 0);
+  auto *patient_stack = PYBIND11_TLS_GET_VALUE(internals.patient_stack_key);
+  if (patient_stack) {
+    return *static_cast<std::vector<PyObject *> *>(patient_stack);
+  }
+  // This leaks a vector allocation for each thread; is there a cleanup mechanism?
+  auto *stack = new std::vector<PyObject *>();
+#if PY_VERSION_HEX >= 0x03070000
+  PyThread_tss_set(internals.patient_stack_key, stack);
+#else
+  PyThread_set_key_value(internals.patient_stack_key, stack);
+#endif
+  return *stack;
+#endif
+}
+// end local modification:
 
 /// Constructs a std::string with the given arguments, stores it in `internals`, and returns its
 /// `c_str()`.  Such strings objects have a long storage duration -- the internal strings are only


diff -u -r a/include/pybind11/detail/internals.h b/include/pybind11/detail/internals.h
--- a/include/pybind11/detail/type_caster_base.h
+++ b/include/pybind11/detail/type_caster_base.h
@@ -34,12 +34,12 @@
 public:
     /// A new patient frame is created when a function is entered
     loader_life_support() {
-        get_internals().loader_patient_stack.push_back(nullptr);
+        get_loader_patient_stack().push_back(nullptr);
     }
 
     /// ... and destroyed after it returns
     ~loader_life_support() {
-        auto &stack = get_internals().loader_patient_stack;
+        auto &stack = get_loader_patient_stack();
         if (stack.empty())
             pybind11_fail("loader_life_support: internal error");
 
@@ -55,7 +55,7 @@
     /// This can only be used inside a pybind11-bound function, either by `argument_loader`
     /// at argument preparation time or by `py::cast()` at execution time.
     PYBIND11_NOINLINE static void add_patient(handle h) {
-        auto &stack = get_internals().loader_patient_stack;
+        auto &stack = get_loader_patient_stack();
         if (stack.empty())
             throw cast_error("When called outside a bound function, py::cast() cannot "
                              "do Python -> C++ conversions which require the creation "

@laramiel
Copy link
Contributor

laramiel commented Sep 2, 2021

I think that the thing to do is something more along the lines of a little internally linked list.

loader_life_support::loader_life_support() would allocate the link on the thread-local and save the parent pointer.

loader_life_support::loader_life_support() would deallocate the link.

loader_life_support::add_patient would get the current head of the linked list and add a PyObject to the keepalive set.

@daniel-smith
Copy link

daniel-smith commented Sep 2, 2021

there were some cases where the python / c++ thread mapping appeared to be incorrect

@laramiel - I wonder if you might be able to expand what you mean by this. It would be good to understand cases where a thread_local implementation would actually fail. I also had something very similar to what you posted above (using Python TLS) and the leaked vector allocations per thread was one of the reasons why I considered thread_local to be a simpler solution (i.e. rely on the compiler). Your linked list idea sounds interesting - do you have a small example of what this might look like? Are you saying that the linked list would live in the internals struct?

@wjakob - thanks for your comments on this. If there are some platforms where measurable inefficiencies have been observed, then avoiding thread_local makes sense. My thinking was along the same lines as @jbms (see #2765 (comment)), i.e. given everything else that is going on in the dispatcher, any inefficiencies are, in practice, not likely to make a real difference. I also feel if people have something truly performance critical, where compiler generated code matters - then my guess is that Python interop with C++ probably wouldn't be the first choice. Also, as @rwgk mentioned - the current implementation is not thread-safe and so we have nothing to compare any thread-safe implementation with at the moment. All of that being said, I'm not a pybind11 expert and I'm happy to go with the consensus on this.

@laramiel
Copy link
Contributor

laramiel commented Sep 2, 2021

@daniel-smith See my solution in the above commit.

laramiel@53b3919

@laramiel
Copy link
Contributor

laramiel commented Sep 4, 2021

The realization of thread_local is complex, generates awful code (try looking at this on godbolt.org).

I disagree. The code is pretty concise when compared with the python thread api initialization, etc. I think that compiler writers have some motivation to improve it as well, so in my opinion it's preferable.

See: https://godbolt.org/z/K57oj9M4P

Skylion007 added a commit that referenced this issue Sep 10, 2021
* Fix thread safety for pybind11 loader_life_support

Fixes issue: #2765

This converts the vector of PyObjects to either a single void* or
a per-thread void* depending on the WITH_THREAD define.

The new field is used by each thread to construct a stack
of loader_life_support frames that can extend the life of python
objects.

The pointer is updated when the loader_life_support object is allocated
(which happens before a call) as well as on release.

Each loader_life_support maintains a set of PyObject references
that need to be lifetime extended; this is done by storing them
in a c++ std::unordered_set and clearing the references when the
method completes.

* Also update the internals version as the internal struct is no longer compatible

* Add test demonstrating threading works correctly.

It may be appropriate to run this under msan/tsan/etc.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update test to use lifetime-extended references rather than
std::string_view, as that's a C++ 17 feature.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make loader_life_support members private

* Update version to dev2

* Update test to use python threading rather than concurrent.futures

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary env in test

* Remove unnecessary pytest in test

* Use native C++ thread_local in place of python per-thread data structures to retain compatability

* clang-format test_thread.cpp

* Add a note about debugging the py::cast() error

* thread_test.py now propagates exceptions on join() calls.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused sys / merge

* Update include order in test_thread.cpp

* Remove spurious whitespace

* Update comment / whitespace.

* Address review comments

* lint cleanup

* Fix test IntStruct constructor.

* Add explicit to constructor

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <skylion.aaron@gmail.com>
jbms added a commit to jbms/pybind11 that referenced this issue Sep 16, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 16, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 16, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 16, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 16, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 17, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 17, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 17, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
jbms added a commit to jbms/pybind11 that referenced this issue Sep 17, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
rwgk pushed a commit to rwgk/pybind11 that referenced this issue Sep 19, 2021
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.

Fix test_pytypes.py::test_issue2361 failure on PyPy3.7

Add github actions tests for unstable ABI
rwgk pushed a commit that referenced this issue Sep 20, 2021
* Eliminate duplicate TLS keys for loader_life_support stack

This revises the existing fix for
#2765 in
#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.

* Fix test_pytypes.py::test_issue2361 failure on PyPy3.7

* Add github actions tests for unstable ABI
copybara-service bot pushed a commit to jax-ml/jax that referenced this issue Jan 10, 2022
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program).

[XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension.

Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences.
PiperOrigin-RevId: 413922787
copybara-service bot pushed a commit to jax-ml/jax that referenced this issue Jan 10, 2022
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program).

[XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension.

Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences.
PiperOrigin-RevId: 420755696
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 10, 2022
Before this change, JAX produces HLO using the XLA:Python builder APIs. After this change JAX produces MHLO using MLIR:Python APIs, and converts the MHLO to HLO for compilation with XLA. This is a lateral shift that should have little immediate impact, but unlocks a number of interesting opportunities in the future (e.g., mixing MLIR dialects within a JAX program).

[XLA:Python] Pass MLIR input as a std::string to work around pybind/pybind11#2765. A better fix would be to update pybind11 but that is hitting Windows-related hurdles; for now, just avoid relying on reference lifetime extension.

Brax: update test seeds to avoid test failures. Additional constant folding (canonicalization) in the MHLO lowering path seems to cause small numerical differences.
PiperOrigin-RevId: 420755696
Change-Id: I5e2626ea1e82c046a847300bf6bbe94208007802
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug casters Related to casters, and to be taken into account when analyzing/reworking casters help wanted
Projects
None yet
Development

No branches or pull requests

9 participants