-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
listener cannot receive the data after restarting container talker node. #349
Comments
you might want to see the detail on this, i believe that this is DDSI-RTPS stuff. thanks, |
When it happened, i always restart listener. Otherwise, listener cannot recv talker's speaking. |
yes, this could be a work-around but huge constrain when it comes to debug data. e.g) using rosbag to prove certain time window, prove(listener) starts once then keep saving data. re-starting listener is not good solution in this situation. |
I have the same problem. Specs:
Steps to reproduce:
With 10 runs of the subscriber, it printed messages on exactly one of them - the first one. I repeated this test twice with the same results. Other observations:
Hypothesis: At this point, my guess is that the publisher is holding an open reference to the IP address and pid of the subscriber, which is causing it to ignore when the subscriber disconnects and reconnects. Rotating either of these on the subscriber "fixes" the problem, but this is still not great and could cause "transient" errors over time (e.g. when the PID counter rolls over - the default linux Is there someone with some background on the rmw publisher code that could look into this? |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/robotics-distributed-system-based-on-kubernetes/12558/54 |
i think that your expectation is correct,
|
I think we could start by changing the Host singleton. On POSIX based systems we could return a value based on the 32-bit integer returned by In the future, we may change the way GUIDs are computed in the place @fujitatomoya pointed to, perhaps extending the size for the host part to 32 bits and only using 16 bits for the participant index part. That would be a bigger change, though. |
thanks for the information,
The 32-bit identifier is intended to be unique among all UNIX systems in existence. (from man GETHOSTID(3)) using host id will fix this problem, since random container name is generated based on random id. (https://github.com/docker/engine/blob/e6d949b9e707c55700c545614d25713bb191aed8/daemon/names.go#L38-L56) |
If GETHOSTID stably and uniquely identifies the current host, won't that put us in the same boat as we're seeing when a ROS2 process running in Docker has the same IP and PID, just with a native host? I would expect the reconnect problems to be worse then, since the GUID would never change and would be identical for all processes on the same host. Or are you suggesting tacking GETHOSTID onto the existing GUID? That would certainly add entropy, but I don't know if it solves the root problem of the publisher not realizing when a subscriber is reconnecting. I know next to nothing about FastRTPS... is there some other indicator (some kind of session ID, sequence ID, health check etc) that we could check to make the publisher forget about a previously subscribed listener? |
All we need is unique identification to tell the host, so I believe the host id can be used.
container would have the same IP and PID, but not host id. so i do not think there would be a problem. |
@fujitatomoya @smartin015 Have you tried with Foxy? I think this may have been solved since version 1.10.0 of Fast RTPS |
thanks, i will confirm it and get back to you. |
using ros2/ros2@0444bff, i confirmed that the problem cannot be reproducible. closing this issue. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: |
Quick heads-up: This issue is still around in head FastRTPS (eProsima/Fast-DDS#1633), and will bite everyone who tries to run two ROS containers in the same Kubernetes pod. |
thanks for the heads-up, we will look into eProsima/Fast-DDS#1633 |
Bug report
Required Info:
Steps to reproduce issue
$ ros2 run demo_nodes_cpp listener
[INFO] [listener]: I heard: [Hello World: 1]
[INFO] [listener]: I heard: [Hello World: 2]
[INFO] [listener]: I heard: [Hello World: 3]
[INFO] [listener]: I heard: [Hello World: 4]
[INFO] [listener]: I heard: [Hello World: 5]
[INFO] [listener]: I heard: [Hello World: 6]
[INFO] [listener]: I heard: [Hello World: 7]
[INFO] [listener]: I heard: [Hello World: 8]
[INFO] [listener]: I heard: [Hello World: 9]
[INFO] [listener]: I heard: [Hello World: 10]
[INFO] [listener]: I heard: [Hello World: 11]
[INFO] [listener]: I heard: [Hello World: 12]
[INFO] [listener]: I heard: [Hello World: 13]
[INFO] [listener]: I heard: [Hello World: 14]
[INFO] [listener]: I heard: [Hello World: 15]
[INFO] [listener]: I heard: [Hello World: 16]
[INFO] [listener]: I heard: [Hello World: 17]
[INFO] [listener]: I heard: [Hello World: 18]
[INFO] [listener]: I heard: [Hello World: 19]
[INFO] [listener]: I heard: [Hello World: 20]
[INFO] [listener]: I heard: [Hello World: 21]
[INFO] [listener]: I heard: [Hello World: 22]
-> listener CANNOT receive the data after talker container restarts, please check the following procedure.
$ docker run ros2_eloquent ros2 run demo_nodes_cpp talker
[INFO] [talker]: Publishing: 'Hello World: 1'
[INFO] [talker]: Publishing: 'Hello World: 2'
[INFO] [talker]: Publishing: 'Hello World: 3'
[INFO] [talker]: Publishing: 'Hello World: 4'
[INFO] [talker]: Publishing: 'Hello World: 5'
[INFO] [talker]: Publishing: 'Hello World: 6'
[INFO] [talker]: Publishing: 'Hello World: 7'
[INFO] [talker]: Publishing: 'Hello World: 8'
[INFO] [talker]: Publishing: 'Hello World: 9'
[INFO] [talker]: Publishing: 'Hello World: 10'
[INFO] [talker]: Publishing: 'Hello World: 11'
[INFO] [talker]: Publishing: 'Hello World: 12'
[INFO] [talker]: Publishing: 'Hello World: 13'
[INFO] [talker]: Publishing: 'Hello World: 14'
[INFO] [talker]: Publishing: 'Hello World: 15'
[INFO] [talker]: Publishing: 'Hello World: 16'
[INFO] [talker]: Publishing: 'Hello World: 17'
[INFO] [talker]: Publishing: 'Hello World: 18'
[INFO] [talker]: Publishing: 'Hello World: 19'
[INFO] [talker]: Publishing: 'Hello World: 20'
[INFO] [talker]: Publishing: 'Hello World: 21'
...
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0901653c0f2d ros2_eloquent "/ros_entrypoint.sh …" 3 seconds ago Up 1 second unruffled_tereshkova
$ docker exec 0901653c0f2d ps -ef | grep talker
root 1 0 6 03:05 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker
root 256 1 2 03:05 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker
-> talker Process ID in container is 256. listener on host can receive the data.
$ docker rm -f 0901653c0f2d
0901653c0f2d
-> kill talker container.
$ docker run ros2_eloquent ros2 run demo_nodes_cpp talker
[INFO] [talker]: Publishing: 'Hello World: 1'
[INFO] [talker]: Publishing: 'Hello World: 2'
[INFO] [talker]: Publishing: 'Hello World: 3'
[INFO] [talker]: Publishing: 'Hello World: 4'
[INFO] [talker]: Publishing: 'Hello World: 5'
[INFO] [talker]: Publishing: 'Hello World: 6'
[INFO] [talker]: Publishing: 'Hello World: 7'
[INFO] [talker]: Publishing: 'Hello World: 8'
[INFO] [talker]: Publishing: 'Hello World: 9'
[INFO] [talker]: Publishing: 'Hello World: 10'
-> restart talker container. and listener on host CANNOT receive the data.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3da0fec3d7f9 ros2_eloquent "/ros_entrypoint.sh …" 6 seconds ago Up 4 seconds angry_pascal
$ docker exec 3da0fec3d7f9 ps -ef | grep talker
root 1 0 5 03:06 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker
root 256 1 2 03:06 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker
-> talker Process ID in container is also 256.
Expected behavior
Listener receives the data.
Actual behavior
Listener does not receive the data.
Additional information
Assigning the same PID(Process ID) for the application, DDS reader recognizes DDS Domain GUID as same.
https://github.com/eProsima/Fast-RTPS/blob/b4f8d12c0e909d3a76e08bd510fd1718c081bb57/src/cpp/rtps/RTPSDomain.cpp#L119-L157
Feature request
Feature description
Implementation considerations
according to https://www.omg.org/spec/DDSI-RTPS/2.3/PDF,
8.2.4.1 Identifying RTPS entities: The GUID
The GUID (Globally Unique Identifier) is an attribute of all RTPS Entities and uniquely identifies the Entity within a DDS Domain.
The text was updated successfully, but these errors were encountered: