Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Race on inst->state between nb_type_get / nb_type_put_common under free-threading #867

Closed
vfdev-5 opened this issue Jan 20, 2025 · 10 comments

Comments

@vfdev-5
Copy link
Contributor

vfdev-5 commented Jan 20, 2025

Problem description

We see few data races when accessing class attribute of another class attribute:

pshard.sharding_spec.sharding

The first race we saw being fixed by #865 .

@wjakob how would you like this second race to be fixed?

Full TSAN report and code reproducer is here: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890

Another similar TSAN report by @hawkinsp : https://gist.github.com/hawkinsp/5f3997c72c2c6781c1d8a90225f3eddd

Reproducible example code

Python code: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890#file-check-py
C++ bindings: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890#file-binding-cpp
CMakeLists: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890#file-cmakelists-txt
@wjakob
Copy link
Owner

wjakob commented Jan 20, 2025

I suspect the following: inst_new_ext (which creates a Python object wrapping an existing C++ instance) right now acquires a lock to the shard containing this object. However, there are some uses of this API which continue to modify the associated Python object following its creation. In the meantime, another thread who is accessing that same pointer can look up the partially created object from the hash table.

What I think needs to change is that locking the shard becomes the responsibility of the caller of inst_new_ext rather than having the function do this. The inst_new_int is unaffected since here we don't know ahead of time what the associated pointer address is.

This race condition aside is also another fundamental race that will need to be figured out: if two threads simultaneously access the same C++ pointer, which currently does not exist on the Python side, there are two outcomes.

Case 1:
Thread 1: No Python object found. Starts to creates Python object
Thread 2: Finds Python object created by Thread 1 & returns that

Case 2:
Thread 1: No Python object found. Starts to creates Python object
Thread 2: No Python object found. Starts to create Python object

That second case sounds like it could be a problem and would be nice to avoid.

@wjakob
Copy link
Owner

wjakob commented Jan 20, 2025

Is it easy for you to repro these? If I send you a patch, can you tell me if that fixed it?

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Jan 20, 2025

Yes, it is easy to reproduce the races, please send the patch.

By the way, here is the test case: vfdev-5#1 which produce the above TSAN report:

Build command

CC=clang-18 CXX=clang++-18 cmake -S . -B build -DNB_TEST_FREE_THREADED=ON -DCMAKE_BUILD_TYPE=DEBUG -DNB_TEST_SANITIZERS_TSAN=ON 

cmake --build build/ -j 8

Run command:

PYTHONPATH=./build/tests python3.14t -m pytest -s tests/test_thread.py -k test07_access_attributes &> test07_access_attributes.log

vfdev-5 added a commit to vfdev-5/nanobind that referenced this issue Jan 20, 2025
vfdev-5 added a commit to vfdev-5/nanobind that referenced this issue Jan 20, 2025
@wjakob
Copy link
Owner

wjakob commented Jan 21, 2025

Thanks. I will not get to it this week, sorry (I have an important deadline).

@wjakob
Copy link
Owner

wjakob commented Jan 27, 2025

It would be useful to discuss this problem before developing a fix. Also adding @oremanj and @hawkinsp for feedback.

To recap, the issue reported here is a race condition, where multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. The critical section protecting the newly created object isn't big enough, and that causes a race on the access of some fields. This particular problem is easily fixed by making the critical section a tiny bit larger.

A more significant issue then pops up: ultimately, what is the expected behavior in this case? For example, is it OK if the cast to a Python object potentially creates multiple independent Python instances in the case of significant concurrency? I think that this is weird, and it would be better to maintain an invariant that each C++ instance can only have one valid Python object associated with it at any time.

However, ensuring that this works as expected would require making the critical section much larger. It would be so large that I'm a little worried about the consequences. Right now, it only protects access to the shard data structure storing C++ instance data.

To ensure the "sane" behavior stated above, we would need to

  1. Lock the shard
  2. Search if an object already exists (unlock + return early if found)
  3. Perform a lookup into the C++ type map (nb_inst_c2p), which involves its own critical section
  4. Analyze return value policy to decide if we want to copy/move (inst_new_int) or wrap the existing instance (inst_new_ext).
  5. In the former case, drop the shard lock and lock a new one associated with the copied/moved C++ instance.
  6. In the latter case, continue using the existing shard lock. The function inst_new_ext calls PyType_GenericAlloc internally, which AFAIK might cause the garbage collector to kick in. If so, arbitrary Python/C++ code might run at this point.
  7. Fill the fields of the nb_inst instance
  8. Work involving lifetime management (keep_shared_from_this_alive, keep_alive(), set_self_py for intrusively managed reference counting). Some of these also involve locking.
  9. Unlock shard lock and return object

It's kind of a mess. Long critical sections are never nice from a performance viewpoint. What is even more worrisome in this case are potential deadlocks and unforeseen consequences from unknown code executing while holding the shard lock.

The easier solution would be to say: "if multiple threads concurrently try to return the same C++ object to Python, we will maintain the integrity of the internal data structures but make no guarantee about how many distinct Python objects result from this."

Thoughts?

@hawkinsp
Copy link
Contributor

The easier solution would be to say: "if multiple threads concurrently try to return the same C++ object to Python, we will maintain the integrity of the internal data structures but make no guarantee about how many distinct Python objects result from this."

An observation: at least in many of my use cases I don't particularly care how many Python objects correspond to each C++ class. e.g., for an immutable object that is basically a glorified tuple I don't care about the object's identity.

In the case of copy or move semantics (inst_new_int), where the object's address in effect changes as part of being cast by nanobind by virtue of being copied or moved, why do we even try to maintain a unique Python object for every C++ object at all?

The invariant only really seems to make sense for reference semantics, surely? And even there, I would imagine that only a subset of users care, so allowing an opt-in or opt-out might make sense.

So perhaps something along the lines of "by default, allow duplication in the case of copy/move semantics, but use a more expensive but conservative approach for reference semantics"?

Even leaving threading aside, I'll note also that in pybind11 I went to some trouble to work around the fact that the bindings added every C++ class to a (slow) hash table, even to the extent of porting certain performance-critical classes away from pybind11 and only using pybind11 where performance didn't matter as much. The nanobind hash table is much faster, but it is still overkill in cases where I don't need nanobind to be uniquifying things for me!

For 6:

In the latter case, continue using the existing shard lock. The function inst_new_ext calls PyType_GenericAlloc internally, which AFAIK might cause the garbage collector to kick in. If so, arbitrary Python/C++ code might run at this point.

Perhaps you'd do best to drop the lock if the instance is not present, create the new instance, reacquire the lock, and then retry the addition. That way you avoid running arbitrary code under the lock, which will certainly deadlock (it might be nanobind-using code). If someone else has in the meantime added the instance, you just discard the one you made. It's probably not the end of the world to make a spurious allocation or two in the case of a race.

@wjakob
Copy link
Owner

wjakob commented Jan 28, 2025

In the case of copy or move semantics (inst_new_int), where the object's address in effect changes as part of being cast by nanobind by virtue of being copied or moved, why do we even try to maintain a unique Python object for every C++ object at all?

For rv_policy::copy, we always create a new instance. For rv_policy::move, we only move if there if the instance doesn't already exist in Python. That's seems quite bug-prone. I wanted to fix this already a while ago but was worried about compatibility implications. I'm thinking of fixing this soon and to see if anyone complains.

The invariant only really seems to make sense for reference semantics, surely?

That's right. This is solely about rv_policy::reference and rv_policy::reference_internal.

@vfdev-5: Here is an attempted fix for the race condition: 7230d4d (branch ft-race). Can you give that at try? It does not yet resolve the issue of concurrent reference casts potentially creating duplicate Python objects.

@wjakob
Copy link
Owner

wjakob commented Jan 28, 2025

I rebased the ft-race branch on top of master. By accident, I had previously added changes to an older version that also included another race condition.

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Jan 28, 2025

@wjakob ft-race 7230d4d commit fixes the race I saw in the original distilled test 🎉 .
Let me recheck with the updated branch -> also fixes the race, sent a PR with the test case to the branch ft-race: #886

vfdev-5 added a commit to vfdev-5/nanobind that referenced this issue Jan 28, 2025
wjakob added a commit that referenced this issue Jan 28, 2025
This commit addresses an issue arising when multiple threads want to
access the Python object associated with the same C++ instance, which
does not exist yet and therefore must be created. @vfdev-5 reported that
TSAN detects a race condition in code that uses this pattern, caused by
concurrent unprotected reads/writes of internal ``nb_inst`` fields.

There is also a larger problem: depending on how operations are
sequenced, it is possible that two threads simultaneously create a
Python wrapper, which violates the usual invariant that each (C++
instance pointer, type) pair maps to at most one Python object.

This PR updates nanobind to preserve this invariant. When registering a
newly created wrapper object in the internal data structures, nanobind
checks if another equivalent wrapper has been created in the meantime.
If so, we destroy the thread's instance and return the registered one.

This requires some extra handling code, that, however, only runs with
very low probability. It also adds a new ``registered`` bit flag to
``nb_inst``, which makes it possible to have ``nb_inst`` objects that
aren't registered in the internal data structures. I am planning to use
that feature to fix the (unrelated) issue #879.
wjakob added a commit that referenced this issue Jan 28, 2025
This commit addresses an issue arising when multiple threads want to
access the Python object associated with the same C++ instance, which
does not exist yet and therefore must be created. @vfdev-5 reported that
TSAN detects a race condition in code that uses this pattern, caused by
concurrent unprotected reads/writes of internal ``nb_inst`` fields.

To fix this issue, we split instance creation and registration into
a two-step process. The latter is only done when the object is fully
constructed.
wjakob pushed a commit that referenced this issue Jan 28, 2025
wjakob added a commit that referenced this issue Jan 28, 2025
* Fix race condition in free-threaded Python (fixes issue #867)

This commit addresses an issue arising when multiple threads want to
access the Python object associated with the same C++ instance, which
does not exist yet and therefore must be created. @vfdev-5 reported that
TSAN detects a race condition in code that uses this pattern, caused by
concurrent unprotected reads/writes of internal ``nb_inst`` fields.

To fix this issue, we split instance creation and registration into
a two-step process. The latter is only done when the object is fully
constructed.

* Added test case for issue #867

---------

Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Jan 28, 2025

We can close this issue now as fixed by #887

@vfdev-5 vfdev-5 closed this as completed Jan 28, 2025
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 28, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 28, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 28, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 28, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 28, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 28, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 28, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 28, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 29, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 29, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 29, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 29, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 29, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 29, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474
PiperOrigin-RevId: 720567148
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 29, 2025
Imported from GitHub PR #21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693f by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f
PiperOrigin-RevId: 720849233
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 29, 2025
Imported from GitHub PR openxla/xla#21960

Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867

cc @hawkinsp
Copybara import of the project:

--
77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>:

Updated nanobind commit

Merging this change closes #21960

PiperOrigin-RevId: 720849233
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants