-
Notifications
You must be signed in to change notification settings - Fork 44
DS hangs when concurrently subscribing to events and destructing DeviceProxy #315
Comments
Sorry Michal, I was on a training the whole week and must have missed your email. I will take a look on it next week. Again many thanks for your help! |
Hi @mliszcz, |
Hi @reszelaz . Do you know how to reproduce the issue using Sardana test suite? |
We don't have an automatic test that triggers this issue. It was observed at the beamline and we were able to reproduce it manually. However the example from this issue description is enough to reproduce it. |
As it was already identified in sardana-org#663 AttributeProxy creation could be avoided and device WAttribute object could be used directly. Do it for pseudo motors and motor groups to avoid problems due to tango-controls/pytango#315.
Hi @reszelaz. I was looking at this again, because I want to make a new release, and if there is a possible fix it would be good to include it. TL;DR I don't think there is anything to fix in PyTango. I can prevent the deadlock in the example by using a shared I guess the reason for the deadlock is similar to what we see in #292. The server handles the request from each client in a different thread, so the Python code being executed to handle each read attribute function can get interrupted after any Python bytecode instruction. In this case we don't have cyclic references, so the garbage collector isn't involved, but the From your backtrace, when we get the deadlock it looks like Device2 is busing with the DeviceProxy destructor, while Device1 is busy handling the callback from the event subscription. I also see the Note: In Code was modified as below, with In import common
...
def read_attr1(self):
with common.lock:
dev = tango.DeviceProxy("sys/tg_test/1")
dev.subscribe_event("double_scalar",
tango.EventType.ATTR_CONF_EVENT,
cb)
return dev.read_attribute("double_scalar").value In import common
...
def read_attr2(self):
with common.lock:
tango.DeviceProxy("sys/tg_test/1")
return time.time() |
I think that the problem is actually with python GIL, so this would mean a problem with pytango. Looks like there is a deadlock between 2 threads trying to do the following:
Together with @tiagocoutinho we are trying to reimplement DeviceProxy in order to force the release of the GIL while unsubscribing. We are still trying to make it work properly. We will let you know ASAP. Of course we would like to get this fix included in the latest release. Could you please wait a little? We would also like to help you with the release process |
Hi @jairomoldes + @tiagocoutinho , thanks for this update and help on this issue and with the release! It is highly appreciated! |
@jairomoldes + @tiagocoutinho Well spotted. That's very interesting. I see how the two locks are causing this now that you have pointed it out 🙂 Sure, we can wait on the release (#342). But I hope it can be done this week. If I understand correctly, the Python code is exiting the Are you planning to add a destructor to the Boost DeviceProxy C++ wrapper class and release the GIL there while calling the libtango destructor? |
Note that ALBA agreed not to delay the v9.3.2 release any longer for this issue. |
Hi, |
DeviceProxy::subscribe_event documentation says the following:
This means the DeviceProxy object must stay alive as long as we want to receive events. As soon as the DeviceProxy object is destroyed, the events which have been subscribed will be unsubscribed. |
Please also note that @mliszcz has done some work on the topic to avoid unintended unsubscriptions when using several proxies pointing to the same device: |
I tried this but the problem is still there. |
Thanks @tiagocoutinho, @bourtemb and @jairomoldes for investigating this issue!
I understand that the above example does look not very realistic when one think about programming with PyTango. In case of using disposable Taurus attributes it is however a very probable scenario - Taurus subscribes to the configuration events when constructing an attribute. I explain you another scenario which we also suffer at ALBA. In case of the MacroServer Tango DS (part of the Sardana project) we run macros (procedures written in Python) in parallel. Within the MacroServer server we define multiple Door Tango devices and each Door can run one macro at the same time. Macros may last for short or long time, may subscribe to events and DeviceProxies may get destroyed when not necessary. To demonstrate this scenario I have changed the originally posted example - see Demo with commands and background jobs. The macros that hang at ALBA does not involve Taurus and just use PyTango. The backtraces of the hung threads point to the same issue but better if someone more experienced confirm that. |
I have just realized that in this example I was not using the |
Thanks, I was wondering if that would make a difference. I will investigate further. |
I tried adding a function to be called when the C++ DeviceProxy object is to be released by the Boost extension. In that new function, I only tried the clients that read the attributes, not the clients that use the commands. Git diff of the relevant changes in the
Output from server, showing some new output. It was busy with a subscribe event, and the garbage collection triggered the clean up of many old device proxy instances. Both the subscription and device proxy cleanup threads had released the GIL. (That makes me wonder if it is safe to release the GIL while Python is doing a garbage collection cycle...)
Sorry for super-long comment, but I didn't want to cut the back traces:
|
@jairomoldes and @tiagocoutinho It would be useful to know what you have already tried, so that we don't repeat those efforts. |
We tried to intercept the DeviceProxy destructor. In this hook we release the GIL and try to unsubscribe from all events. Fortunately @jairomoldes kept the reference to our tests. |
Hi PyTango experts,
I would like to report an issue in PyTango (or maybe cppTango?) which affects Sardana project and is actually a big show stopper for us.
In summary: while subscribing to events in one device and destroying DeviceProxy in another one in parallel the device server hangs forever.
I think that all the necessary information and the steps to reproduce it are in https://github.com/reszelaz/test-tango. Note that this may be a similar/related issue to #292 so I mention here @bourtemb and @mliszcz who were involved in this investigation. Thanks in advance for looking into this as well!
If you have any questions, don't hesitate to ask.
Cheers,
Zibi
The text was updated successfully, but these errors were encountered: