-
Notifications
You must be signed in to change notification settings - Fork 44
DS crashes when unsubscribing from events in __del__ #292
Comments
Hi Zibi, Have you tried to call content of the |
Use DeviceProxy instead of taurus to avoid crashes in Py3 See: tango-controls/pytango#292 To be reverted when the above issue gets clarified
I've started looking at this. Don't know what the cause of the crash is yet, but I don't think The Python 2.7 $ python
Python 2.7.6 (default, Nov 13 2018, 12:45:42)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>>
>>> class PyTangoDevice(object):
... def __init__(self, name):
... print("PyTangoDevice.__init__")
... self._state = self.getAttribute("state")
... def getAttribute(self, name):
... return PyTangoAttribute(name, self)
...
>>> class PyTangoAttribute(object):
... def __init__(self, name, dev):
... print("PyTangoAttribute.__init__")
... self._dev = dev
... def __del__(self):
... print("PyTangoAttribute.__del__")
...
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> dev = None
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> gc.collect()
4
>>> gc.garbage
[<__main__.PyTangoAttribute object at 0x7f3842fcb910>]
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> Python 3.4 $ python3
Python 3.4.3 (default, Nov 12 2018, 22:25:49)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>>
>>> class PyTangoDevice(object):
... def __init__(self, name):
... print("PyTangoDevice.__init__")
... self._state = self.getAttribute("state")
... def getAttribute(self, name):
... return PyTangoAttribute(name, self)
...
>>> class PyTangoAttribute(object):
... def __init__(self, name, dev):
... print("PyTangoAttribute.__init__")
... self._dev = dev
... def __del__(self):
... print("PyTangoAttribute.__del__")
...
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> dev = None
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> gc.collect()
PyTangoAttribute.__del__
4
>>> gc.garbage
[]
>>> wr
<weakref at 0x7ffa87a6bc28; dead>
>>> |
Hi @reszelaz, Looking at the backtrace in https://github.com/reszelaz/test-tango-py3 README, it looks like there is an exception thrown when trying to acquire a Tango monitor in your case. |
First off all thanks @jkotan, @ajoubertza and @bourtemb for looking into this!
I have done explicit unsubscribes and, from my tests, it looks like the problem disappears. You can see the corresponding code in the tango-test-py3/explicit_unsub. This avoids the unsubscription in the
Taurus attributes subscribe to configuration events in
We observe this problem when iterating over attributes and continuously subscribing and unsubscribing from events. The thing is that the unsubscription is triggered by the garbage collector. I suspect these unsubscriptions somehow collides with the subscriptions of the other attributes. |
Thanks @bourtemb and @reszelaz for working on this today. As promised, here is a summary of our findings:
subscribe_event is not finished yet but unsubscribe_event is called from Taurus. unsubscribe_event fails to acquire monitor (in DelayEvent). Both subscribe and unsubscribe are called from the same python Thread object. During subscribe, no omni_thread is configured. Hypothesis 1: Hypothesis 2: Anyway, calling subscribe/unsubscribe_event from non-omni-orb threads (like python threads), is not supported by cppTango monitor (probably we should open an issue for fixing it/printing a warning/documenting). With below patch I am unable to reproduce the issue. It's not complete, you still should delete the diff --git a/taurus/PyDsExpClient.py b/taurus/PyDsExpClient.py
index a88a924..7c2c59c 100644
--- a/taurus/PyDsExpClient.py
+++ b/taurus/PyDsExpClient.py
@@ -7,6 +7,20 @@ from taurus.core.tango import TangoDevice
import PyTango
+import cffi
+ffibuilder = cffi.FFI()
+ffibuilder.cdef("void* omni_thread_ensure_self();")
+lib = ffibuilder.verify(r"""
+ #include <omnithread.h>
+ #include <cstdio>
+ void* omni_thread_ensure_self()
+ {
+ return new omni_thread::ensure_self;
+ }
+ """,
+ source_extension='.cpp',
+ libraries=["omnithread"]
+)
DEV_NAME_PATTERN = "test/pydsexp/{}"
@@ -30,6 +44,7 @@ class JobThread(threading.Thread):
self.dev = dev
def run(self):
+ es = lib.omni_thread_ensure_self()
for i in range(100):
if self.dev and self.dev._stop_flag:
break |
Thanks to you two for investigating this! I just tried to run the Then a question to the PyTango maintainers would be if we could have a convenient API for calling the |
@bourtemb How about:
I think this should solve this issue (without any changes in PyTango) as well as any issues you experience with C++11 threads. |
@bourtemb another option is to gradually remove dependency on omni_thread API, with C++11 threads or some Tango-specific class which will wrap the low-level threading interface. For instance replace #ifdef LINUX
pthread_t id = pthread_self();
#else
DWORD id = GetCurrentThreadId();
#endif |
I agree with you. I think we should not do that.
This thread_local feature might be very useful indeed in our use case.
I think it's a good idea. It might actually be enough to declare it at global scope and to initialize it there. It's maybe not needed to add this |
This works when using pthreads, but would this work with standard C++11 threads? |
According to this SO question:
C++11 threads are just convenient wrappers around basic threading libraries. On linux systems it is probably always pthreads (not sure about BSD, Solaris and others), e.g. on Ubuntu 18.04: /usr/include/c++/7/thread /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-posix.h See e.g. an example how to use pthread API to change priority of C++11 thread: https://en.cppreference.com/w/cpp/thread/thread/native_handle The idea is to have a simple class like TangoThreadId, which will store native thread id during construction, and overload operator== for comparison. But I'm not sure if without creating "ensure_self" you can access omni_mutex or omni_contition. |
Let me answer myself:
class _OMNITHREAD_NTDLL_ omni_mutex {
public:
omni_mutex();
~omni_mutex();
inline void lock() { OMNI_MUTEX_LOCK_IMPLEMENTATION }
inline void unlock() { OMNI_MUTEX_UNLOCK_IMPLEMENTATION }
...
#define OMNI_MUTEX_LOCK_IMPLEMENTATION \
pthread_mutex_lock(&posix_mutex);
#define OMNI_MUTEX_TRYLOCK_IMPLEMENTATION \
return !pthread_mutex_trylock(&posix_mutex); Similar for NT threads. So we need omni-thread friendly threads (ensure_self) probably only for ID access and thread local storage. |
Hi Michal, I finally I could try your patch proposed in #292 (comment). I tried it in the Sardana code (this is where we originally found the issue) and not in the simplified example from the initial comment of this issue. I tried it in two different places:
es = lib.omni_thread_ensure_self() in taurus.core.util.threadpool.Worker. Sardana just uses this kind of worker threads and actually macros are executed by them.
Note that the second scenario may cause that this line will be executed more than once by the same thread. The first one does not have this risk. Then of course I have reverted our workaround commit so we again use disposable Unfortunatelly our testsuite hangs using this workaround - I run it 5 times always with the same result. The same testsuite without doing anything works correctly (we run it for months in CI and I also repeated it now on my testing platform). Also if I just apply your code ( |
Just to complement the previous comment here is the branch on which I tried it. The last commit applies the workaround as explaind in point 2. The previous commit is the rever of our workaround. |
Hi @reszelaz thanks for detailed information! I'm trying to reproduce the issue using Sardana test suite but I do not see any crash or deadlock. I have: cppTango (tango-9-lts), pytango (v9.3.1 from git), itango (latest from pypi), taurus (latest from pypi), sardana (branch workaround_pytango292 from your fork), all running on Python 3. I have started Pool (demo1), MacroServer (demo1), ran sar_demo from spock and then ran
Is this the correct way to reproduce the issue? Which test case from Sardana test suite shows the problem? |
I will answer with more details on Monday (I'm out of office now). Similar problems happens with #318. I investigated a little bit and I suspect that the events stop to work at some point. I was even able to reproduce it without running the whole test suite. We may be hitting another bug.. |
Hi again,
As we can see, at the very end, the client (python -m unittest ....) receives "API_EventTimeout". Before continuing, let me explain a little bit what this TestMeasurementGroup do. There is just one test definitiion and we execute it 6 times with different parameters (differnt experimental channels in the MeasurementGroup). Every test execution creates a Pool device server (until there is a problem with one of the tests it will always use the same instance name: unittest1_1) and populates it with test devices. Then it creates a MeasurementGroup with the experimental channels specified as parameters and makes an acquisition. And all this is repeated 6 times. But it got hangs before finishing... When the tests are hung aparently the Pool device server does not hang. Its devices respond to the state queries (also from the MeasurementGroup device). So, now I use:
def run(self):
with tango.EnsureOmniThread():
get = self.pool.jobs.get
while True:
cmd, args, kw, callback, th_id, stack = get()
if cmd:
self.busy = True
self.cmd = cmd.__name__
try:
if callback:
callback(cmd(*args, **kw))
else:
cmd(*args, **kw)
except:
orig_stack = "".join(format_list(stack))
self.error("Uncaught exception running job '%s' called "
"from thread %s:\n%s",
self.cmd, th_id, orig_stack, exc_info=1)
finally:
self.busy = False
self.cmd = ''
else:
self.pool.workers.remove(self)
return
We also tried this workaround at one of the beamlines with much more complicated macros than these tests. The result was that the Sardana servers were hanging after prior reports of serialization monitor errors. |
Hi @reszelaz. This is quite a difficult problem to solve! The Taurus thread pool implementation looks correct. I searched the Sardana code for "thread" and I do see a few other places where standard |
Thanks Anton for looking into it again! I have advanced a little bit with the debugging, let's see if the following information is helpful to you... Actually the tests (client side) that I use to reproduce this issue don't use threads. The server (Pool) which is used in these tests uses threads but I suppose it does not matter. The I have put some debug prints in PyTango:
From the all tests output you can see that in the case of the previous tests the garbage collection does not happen while we subsribe. I also attach backtrace of all threads from the hung tests process. You should be able to reproduce it very easily. It happens almost always when running:
However sometimes it luckily finishes even after throwing some API_EventTimeouts, but I would say that in 80 % of the times it hangs the tests process. It is enough to use:
|
I was also thinking about which thread the Instead, I made a simple test script (no Tango code involved) that creates objects with reference cycles from a number of threads, with random sleeps in between. The There's an interesting PEP related to this that hasn't been implemented yet (deferred state): https://www.python.org/dev/peps/pep-0556/ They talk about reentrancy being a problem. Even better is this blog article: https://codewithoutrules.com/2017/08/16/concurrency-python/ which warns of doing much in
This might be dangerous, since the dummy omniORB ID will only last as long as the You can also use the new |
@ajoubertza, thanks for nice readings. It looks like the second article describes what is going on with Taurus/Sardana in a simple way. In py2 because of cyclic dependences the content of
Anyone knows any other option? |
Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since |
Hi all, Thanks Jan and Anton for investigating this issue. I see that you advanced a lot! I have tried to use
I've done it last Friday, based on this I took my vaugh interpretation of what is going on. There are just two threads: MainThread and Dummy-1. The Dummy-1 is the one which executes event callbacks. Then if two conditions are met:
it hangs the tests.
You can try using the reszelaz/sardana-test docker image and follow the instructions from the "How to develop sardana using this image" section. I just updated the image so it uses the latest taurus from develop. docker pull reszelaz/sardana-test
docker run -d --name=sardana-test -h sardana-test -v /home/zreszela/workspace/sardana:/sardana reszelaz/sardana-test
docker exec sardana-test bash -c "cd /sardana && python3 setup.py develop"
docker exec sardana-test bash -c "python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup -v" Note: In my case, to reproduce this issue on docker I was forced to add more tests to the Whenever you reproduce it, to repeat it again, you will need to kill the hung processes (server & tests) and clean the DB: docker exec sardana-test bash -c "killall -9 /usr/bin/python3"
docker exec sardana-test bash -c "python3 -c 'import tango; db = tango.Database(); [db.delete_server(server) for server in db.get_server_list(\"Pool/unittest*\")]'" I just had time to read the above blog article. Based on it I wonder what is the lock that we deadlock on? It must come from Tango, right? I have made some investigation and I would rather discard this PyTango lock: Line 1073 in 5865b6f
Please correct if I'm wrong, but if we are just using MainThread and Dummy-1 thread (Tango event consumer) and it still hangs (or produces API_EventTimeouts), then maybe we are hitting some other problem and not the one that can be fixed by the |
In cppTango, when a DeviceProxy object is deleted, it unsubscribes to the events already subscribed via this DeviceProxy object. Here are the lines where the lock is taken: In EventConsumer::subscribe_event: I think what's happening in your use case is very similar to what is described in https://codewithoutrules.com/2017/08/16/concurrency-python/ in the "Reentrancy!" section. As far as I understand, the main problem comes from the fact that we enter into a situation where the garbage collector is invoked in the middle of an operation which has taken the lock (subscribe_event) and has not yet released it. This is quite an annoying problem because the problem should not be present in C++. The warning in the Python3 documentation is also clear on the topic: |
This blog post seems to discuss a very similar problem in the MongoDB python driver: Another solution might be to start a one shot timer in the
It seems to work in this simple test case, but I don't know if it is safe in general. |
Thanks Reynald for the Tango C++ details! And Tim for joining this dicussion! Things are getting clearer in my mind now.
I think that Anton here referred to remove the unsubscribes from Also about:
Actually I'm not sure if in Python2 these are were not called. I have a vauge memories that these were called but maybe not so often, and maybe much more later in time. But I think it is not worth investigating this. If you agree let's focus on Python 3. So, what I tried is what Anton said (or at least what I understood:) - to not call unsubscribes in Raynald (@bourtemb), since I'm not so fluent in C++, could you clarify me if in either of these occassions:
it will still try to get the lock when destructing the DeviceProxy? I'm almost sure, but please also confirm, the lock is global per client, and not per DeviceProxy? Now, a question more to Python experts. When a GC calls The comments below the Tim's first link somehow confirms this:
But since there is a contradictory comments below, and also they mention PyPy implementation, I'm not 100% sure yet. |
I think that comment is wrong. It should deadlock when the
Depending on the thread where Edit: |
~DeviceProxy() calls unsubscribe_all_events() unsubscribe_all_events() calls ZmqEventConsumer::get_subscribed_event_ids() ZmqEventConsumer::get_subscribed_event_ids() takes ZmqEventConsumer map_modification_lock, as ReaderLock
I confirm, this ZmqEventConsumer::map_modification_lock lock is global. It is a lock associated to the ZmqEventConsumer object instance, which is unique per client. |
To clarify, Calling one of these methods in a *Actually, the GC of CPython only runs after instructions that allocate, so avoiding allocations in code paths that hold the lock is a possible but difficult workaround. There are at least the following solutions or workarounds:
|
Yes, @reszelaz I meant remove the Python calls to unsubscribe in Thanks for the great analysis, everyone! Considering @schooft's suggestions:
|
I didn't mean to disable the GC but to not allocate any objects in the problematic code paths as done for MongoDB version 2. As far as I understand, the GC is only triggered when new python objects are allocated except for some special occasions like shutdown.
This makes sense. The GC only actually runs when allocations exceed certain thresholds. By running it regularly, the times when the allocations exceed the thresholds are changed. If there are only few allocations, the GC might not run at all outside the manual invocations.
Good point. I agree, that there is likely another problem. The backtrace attached in #292 (comment) does not contain any calls from the garbage collector. At least in the toy example below, I can see the following frames:
This is the toy example that triggers the GC on a random thread by allocating many objects:
On my PC this deadlocks roughly every second time. It is noteworthy that at least one garbage collectable object (not necessarily the problematic one) needs to be created on the thread that holds the lock, otherwise I didn't manage to trigger the GC on that thread. @reszelaz Could you also create a backtrace with Edit: The backtrace above is now the correct one for the toy example. |
I got a backtrace myself for the PyTango test case (I am not sure if my system is setup correctly as I have a lot of boost error handling stuff in the full backtrace):
So while being in the print function in The relevant backtrace that leads to the call of
The relevant backtrace from within the
The full backtrace backtrace.txt This was one of the runs. During other runs, I didn't get a deadlock/crash but instead this exception:
This was the |
To trigger the above error reliably, just call |
Hi all, Tim, sorry that I did not sent you the backtrace before but this morning I rebooted my PC and afterwards the Sardana tests that I used to run started to behave differently. At the beginning I was getting crazy but now I think it helped me to advance a little bit... Now, regarding your tests from #292 (comment). I'm not sure, but I think that for this one it is needed to use the Also, regarding the potential deadlock on At some time came to my mind that instead of calling unsubscribes in Thanks again for all your help in trying to fix this issue! |
I tried the example, and get almost identical result (only differences in the API_EventTimeout text). My test environment is Docker + Conda + Python 3.7.5 + cppTango 9.3.2 + your pytango branch.
|
@reszelaz Thank you for this great example case! On my machine I get the same error. I also implemented the same client in C++ and the error occurs there as well. |
Thanks for the tests! WIth Tim's example in C++ I already reported it tango-controls/cppTango#686. |
Hi all, I was just reviewing old issues and I think that this one can be closed already. We are happy with the fix with cppTango. |
Hi PyTango experts,
I would like to report an issue that we found in Sardana project after migrating it to Python 3.
First of all sorry for the complexity of the exaples (one uses pure PyTango and another one Taurus) to reproduce it, but I was not able to reduce them more. Also sorry if the problem is finally not in PyTango but in Tango. But I do not have knowledge on how to reproduce it with Tango C++.
I think all the necessary information and the steps to reproduce it are in https://github.com/reszelaz/test-tango-py3.
If you have any questions, don't hesitate to ask.
Cheers,
Zibi
The text was updated successfully, but these errors were encountered: