-
Notifications
You must be signed in to change notification settings - Fork 44
Use OmniThreads in MultiDeviceTestContext #387
Use OmniThreads in MultiDeviceTestContext #387
Conversation
Thanks for the suggestion. I doesn't solve the problem though. From my previous digging, it is in the cppTango layer. There is some global state that isn't cleaned up/re-initialised correctly which causes the seg fault. The workaround are using A simple way to cause the problem (which also fails with the code changes in this PR):
import tango
from tango.server import Device
from tango.test_context import DeviceTestContext
with DeviceTestContext(Device, process=False):
pass
with DeviceTestContext(Device, process=False): # this will crash
pass Trying to launch the second device causes the seg fault when the server is starting up. I think we need the cppTango experts to come to the rescue! More detailed stack trace (from a Conda environment):
|
Hi @ajoubertza , just a courtesy message to say thanks for the review. I intend to follow up on this PR but it might be a few weeks before I have time. |
Hi, You may attempt to work-around this by resetting things. I was able to re-initialize a simple c++ device server without any crashes with the following code (it's just a PoC, note that there may be all sorts of bugs and memory leaks and it may not work with different cppTango version due to access to the private variables). #include <tango.h>
#include <dserverclass.h>
#include <dserversignal.h>
#include <eventsupplier.h>
void reset_util();
void reset_dserver();
void reset_dserver_signal();
void reset_zmq_event_supplier();
int main(int argc, char** argv)
{
Tango::Util *util = Tango::Util::init(argc, argv);
util->server_init(false);
util->server_run();
// We need to reset at least these ...
delete util;
reset_util();
reset_dserver();
reset_dserver_signal();
reset_zmq_event_supplier();
// Another server loop
util = Tango::Util::init(argc, argv);
util->server_init(false);
util->server_run();
return 0;
}
template <auto ptr>
struct ResetUtil {
friend void reset_util() {
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetDServer {
friend void reset_dserver() {
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetDServerSignal {
friend void reset_dserver_signal() {
delete (**ptr).sig_th;
*ptr = nullptr;
}
};
template <auto ptr>
struct ResetZmqEventSupplier {
friend void reset_zmq_event_supplier() {
*ptr = nullptr;
}
};
template struct ResetUtil<&Tango::Util::_instance>;
template struct ResetDServer<&Tango::DServerClass::_instance>;
template struct ResetDServerSignal<&Tango::DServerSignal::_instance>;
template struct ResetZmqEventSupplier<&Tango::ZmqEventSupplier::_instance>; |
Thanks for the detailed example, @mliszcz. I guess we can try it out on the PyTango extension and see what happens. Maybe if it works well, we can consider adding some reset methods to cppTango. |
The general idea is to destroy all the singleton instances (by calling destructor) and then reset the static member that points to the instance (because only then we will initialize them again). The problem is that sometimes the destructors are not defined or do not release resources allocated in constructor (e.g. DServerSignal does not stop signal handling thread). I just had one more look at the implementation and it seems that signal handler thread simply cannot be stopped as it lacks any exit handling. I agree that the proper solution would be to change cppTango to clean-up its resources. But this most likely will break ABI and we can have that only in 9.4. Also, this cleanup would be hard to get right even by changing cppTango directly. Doing this from PyTango seems possible but even harder. |
I think there are separate issues of
I don't have time to chase that new bug right now, let alone progress this, so I will close this PR. |
Per https://pytango.readthedocs.io/en/stable/howto.html#using-clients-with-multithreading
and in the documentation of
tango.test_context.MultiDeviceTestContext
, it says:From personal experience these segfaults in thread mode are quite common, to the extent that there isn't a lot of point in running
MultiDeviceTestContext
withprocess=False
.In thread mode,
MultiDeviceTestContext
uses python'sthreading
module, and does not use the OmniThread work-around described above. Logic suggests that the segfaults that occur in threading mode might be eliminated by the workaround.This merge request implements the OmniThread workaround into
MultiDeviceTestContext
.The only detail to be noted is:
Unfortunately I don't know how to add a unit test for this.
I can't even test it informally: I have code that is currently segfaulting when (and only when)
MultiDeviceTestContext
is inprocess=False
mode. But when I spin up a docker image with editable installs of both tango and my repo, my code stubbornly refuses to segfault in there.