-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Race on inst->state
between nb_type_get / nb_type_put_common under free-threading
#867
Comments
I suspect the following: What I think needs to change is that locking the shard becomes the responsibility of the caller of This race condition aside is also another fundamental race that will need to be figured out: if two threads simultaneously access the same C++ pointer, which currently does not exist on the Python side, there are two outcomes. Case 1: Case 2: That second case sounds like it could be a problem and would be nice to avoid. |
Is it easy for you to repro these? If I send you a patch, can you tell me if that fixed it? |
Yes, it is easy to reproduce the races, please send the patch. By the way, here is the test case: vfdev-5#1 which produce the above TSAN report: Build command CC=clang-18 CXX=clang++-18 cmake -S . -B build -DNB_TEST_FREE_THREADED=ON -DCMAKE_BUILD_TYPE=DEBUG -DNB_TEST_SANITIZERS_TSAN=ON
cmake --build build/ -j 8 Run command: PYTHONPATH=./build/tests python3.14t -m pytest -s tests/test_thread.py -k test07_access_attributes &> test07_access_attributes.log |
Thanks. I will not get to it this week, sorry (I have an important deadline). |
It would be useful to discuss this problem before developing a fix. Also adding @oremanj and @hawkinsp for feedback. To recap, the issue reported here is a race condition, where multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. The critical section protecting the newly created object isn't big enough, and that causes a race on the access of some fields. This particular problem is easily fixed by making the critical section a tiny bit larger. A more significant issue then pops up: ultimately, what is the expected behavior in this case? For example, is it OK if the cast to a Python object potentially creates multiple independent Python instances in the case of significant concurrency? I think that this is weird, and it would be better to maintain an invariant that each C++ instance can only have one valid Python object associated with it at any time. However, ensuring that this works as expected would require making the critical section much larger. It would be so large that I'm a little worried about the consequences. Right now, it only protects access to the To ensure the "sane" behavior stated above, we would need to
It's kind of a mess. Long critical sections are never nice from a performance viewpoint. What is even more worrisome in this case are potential deadlocks and unforeseen consequences from unknown code executing while holding the shard lock. The easier solution would be to say: "if multiple threads concurrently try to return the same C++ object to Python, we will maintain the integrity of the internal data structures but make no guarantee about how many distinct Python objects result from this." Thoughts? |
An observation: at least in many of my use cases I don't particularly care how many Python objects correspond to each C++ class. e.g., for an immutable object that is basically a glorified tuple I don't care about the object's identity. In the case of copy or move semantics ( The invariant only really seems to make sense for reference semantics, surely? And even there, I would imagine that only a subset of users care, so allowing an opt-in or opt-out might make sense. So perhaps something along the lines of "by default, allow duplication in the case of copy/move semantics, but use a more expensive but conservative approach for reference semantics"? Even leaving threading aside, I'll note also that in pybind11 I went to some trouble to work around the fact that the bindings added every C++ class to a (slow) hash table, even to the extent of porting certain performance-critical classes away from pybind11 and only using pybind11 where performance didn't matter as much. The nanobind hash table is much faster, but it is still overkill in cases where I don't need nanobind to be uniquifying things for me! For 6:
Perhaps you'd do best to drop the lock if the instance is not present, create the new instance, reacquire the lock, and then retry the addition. That way you avoid running arbitrary code under the lock, which will certainly deadlock (it might be nanobind-using code). If someone else has in the meantime added the instance, you just discard the one you made. It's probably not the end of the world to make a spurious allocation or two in the case of a race. |
For
That's right. This is solely about @vfdev-5: Here is an attempted fix for the race condition: 7230d4d (branch |
I rebased the |
This commit addresses an issue arising when multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. @vfdev-5 reported that TSAN detects a race condition in code that uses this pattern, caused by concurrent unprotected reads/writes of internal ``nb_inst`` fields. There is also a larger problem: depending on how operations are sequenced, it is possible that two threads simultaneously create a Python wrapper, which violates the usual invariant that each (C++ instance pointer, type) pair maps to at most one Python object. This PR updates nanobind to preserve this invariant. When registering a newly created wrapper object in the internal data structures, nanobind checks if another equivalent wrapper has been created in the meantime. If so, we destroy the thread's instance and return the registered one. This requires some extra handling code, that, however, only runs with very low probability. It also adds a new ``registered`` bit flag to ``nb_inst``, which makes it possible to have ``nb_inst`` objects that aren't registered in the internal data structures. I am planning to use that feature to fix the (unrelated) issue #879.
This commit addresses an issue arising when multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. @vfdev-5 reported that TSAN detects a race condition in code that uses this pattern, caused by concurrent unprotected reads/writes of internal ``nb_inst`` fields. To fix this issue, we split instance creation and registration into a two-step process. The latter is only done when the object is fully constructed.
* Fix race condition in free-threaded Python (fixes issue #867) This commit addresses an issue arising when multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. @vfdev-5 reported that TSAN detects a race condition in code that uses this pattern, caused by concurrent unprotected reads/writes of internal ``nb_inst`` fields. To fix this issue, we split instance creation and registration into a two-step process. The latter is only done when the object is fully constructed. * Added test case for issue #867 --------- Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
We can close this issue now as fixed by #887 |
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720567148
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21960 from vfdev-5:update-nanobind 77e693fb39e0b737016770585c3f8786eb141474 PiperOrigin-RevId: 720567148
Imported from GitHub PR #21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693f by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 COPYBARA_INTEGRATE_REVIEW=#21960 from vfdev-5:update-nanobind 77e693f PiperOrigin-RevId: 720849233
Imported from GitHub PR openxla/xla#21960 Point nanobind to the commit fixing python/c++ object concurrent accessing: wjakob/nanobind#867 cc @hawkinsp Copybara import of the project: -- 77e693fb39e0b737016770585c3f8786eb141474 by vfdev-5 <vfdev.5@gmail.com>: Updated nanobind commit Merging this change closes #21960 PiperOrigin-RevId: 720849233
Problem description
We see few data races when accessing class attribute of another class attribute:
inst->state
.The first race we saw being fixed by #865 .
@wjakob how would you like this second race to be fixed?
Full TSAN report and code reproducer is here: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890
Another similar TSAN report by @hawkinsp : https://gist.github.com/hawkinsp/5f3997c72c2c6781c1d8a90225f3eddd
Reproducible example code
The text was updated successfully, but these errors were encountered: