Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listener cannot receive the data after restarting container talker node. #349

Closed
fujitatomoya opened this issue Feb 21, 2020 · 16 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@fujitatomoya
Copy link
Collaborator

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • ros:eloquent
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • demo_nodes_cpp

Steps to reproduce issue

$ ros2 run demo_nodes_cpp listener
[INFO] [listener]: I heard: [Hello World: 1]
[INFO] [listener]: I heard: [Hello World: 2]
[INFO] [listener]: I heard: [Hello World: 3]
[INFO] [listener]: I heard: [Hello World: 4]
[INFO] [listener]: I heard: [Hello World: 5]
[INFO] [listener]: I heard: [Hello World: 6]
[INFO] [listener]: I heard: [Hello World: 7]
[INFO] [listener]: I heard: [Hello World: 8]
[INFO] [listener]: I heard: [Hello World: 9]
[INFO] [listener]: I heard: [Hello World: 10]
[INFO] [listener]: I heard: [Hello World: 11]
[INFO] [listener]: I heard: [Hello World: 12]
[INFO] [listener]: I heard: [Hello World: 13]
[INFO] [listener]: I heard: [Hello World: 14]
[INFO] [listener]: I heard: [Hello World: 15]
[INFO] [listener]: I heard: [Hello World: 16]
[INFO] [listener]: I heard: [Hello World: 17]
[INFO] [listener]: I heard: [Hello World: 18]
[INFO] [listener]: I heard: [Hello World: 19]
[INFO] [listener]: I heard: [Hello World: 20]
[INFO] [listener]: I heard: [Hello World: 21]
[INFO] [listener]: I heard: [Hello World: 22]

-> listener CANNOT receive the data after talker container restarts, please check the following procedure.

$ docker run ros2_eloquent ros2 run demo_nodes_cpp talker
[INFO] [talker]: Publishing: 'Hello World: 1'
[INFO] [talker]: Publishing: 'Hello World: 2'
[INFO] [talker]: Publishing: 'Hello World: 3'
[INFO] [talker]: Publishing: 'Hello World: 4'
[INFO] [talker]: Publishing: 'Hello World: 5'
[INFO] [talker]: Publishing: 'Hello World: 6'
[INFO] [talker]: Publishing: 'Hello World: 7'
[INFO] [talker]: Publishing: 'Hello World: 8'
[INFO] [talker]: Publishing: 'Hello World: 9'
[INFO] [talker]: Publishing: 'Hello World: 10'
[INFO] [talker]: Publishing: 'Hello World: 11'
[INFO] [talker]: Publishing: 'Hello World: 12'
[INFO] [talker]: Publishing: 'Hello World: 13'
[INFO] [talker]: Publishing: 'Hello World: 14'
[INFO] [talker]: Publishing: 'Hello World: 15'
[INFO] [talker]: Publishing: 'Hello World: 16'
[INFO] [talker]: Publishing: 'Hello World: 17'
[INFO] [talker]: Publishing: 'Hello World: 18'
[INFO] [talker]: Publishing: 'Hello World: 19'
[INFO] [talker]: Publishing: 'Hello World: 20'
[INFO] [talker]: Publishing: 'Hello World: 21'
...

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0901653c0f2d ros2_eloquent "/ros_entrypoint.sh …" 3 seconds ago Up 1 second unruffled_tereshkova
$ docker exec 0901653c0f2d ps -ef | grep talker
root 1 0 6 03:05 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker
root 256 1 2 03:05 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker

-> talker Process ID in container is 256. listener on host can receive the data.

$ docker rm -f 0901653c0f2d
0901653c0f2d

-> kill talker container.

$ docker run ros2_eloquent ros2 run demo_nodes_cpp talker
[INFO] [talker]: Publishing: 'Hello World: 1'
[INFO] [talker]: Publishing: 'Hello World: 2'
[INFO] [talker]: Publishing: 'Hello World: 3'
[INFO] [talker]: Publishing: 'Hello World: 4'
[INFO] [talker]: Publishing: 'Hello World: 5'
[INFO] [talker]: Publishing: 'Hello World: 6'
[INFO] [talker]: Publishing: 'Hello World: 7'
[INFO] [talker]: Publishing: 'Hello World: 8'
[INFO] [talker]: Publishing: 'Hello World: 9'
[INFO] [talker]: Publishing: 'Hello World: 10'

-> restart talker container. and listener on host CANNOT receive the data.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3da0fec3d7f9 ros2_eloquent "/ros_entrypoint.sh …" 6 seconds ago Up 4 seconds angry_pascal

$ docker exec 3da0fec3d7f9 ps -ef | grep talker
root 1 0 5 03:06 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker
root 256 1 2 03:06 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker

-> talker Process ID in container is also 256.

Expected behavior

Listener receives the data.

Actual behavior

Listener does not receive the data.

Additional information

Assigning the same PID(Process ID) for the application, DDS reader recognizes DDS Domain GUID as same.

https://github.com/eProsima/Fast-RTPS/blob/b4f8d12c0e909d3a76e08bd510fd1718c081bb57/src/cpp/rtps/RTPSDomain.cpp#L119-L157

Feature request

Feature description

Implementation considerations

according to https://www.omg.org/spec/DDSI-RTPS/2.3/PDF,

8.2.4.1 Identifying RTPS entities: The GUID

The GUID (Globally Unique Identifier) is an attribute of all RTPS Entities and uniquely identifies the Entity within a DDS Domain.

@fujitatomoya
Copy link
Collaborator Author

@MiguelCompany

you might want to see the detail on this, i believe that this is DDSI-RTPS stuff.

thanks,

@claireyywang claireyywang added the bug Something isn't working label Mar 5, 2020
@aitazhixin
Copy link

When it happened, i always restart listener. Otherwise, listener cannot recv talker's speaking.

@fujitatomoya
Copy link
Collaborator Author

@aitazhixin

i always restart listener. Otherwise, listener cannot recv talker's speaking.

yes, this could be a work-around but huge constrain when it comes to debug data.

e.g) using rosbag to prove certain time window, prove(listener) starts once then keep saving data. re-starting listener is not good solution in this situation.

@smartin015
Copy link

smartin015 commented Apr 1, 2020

I have the same problem.

Specs:

  • Docker version 19.03.8, build afacb8b7f0
  • Ubuntu 19.10
  • uname -a: Linux localhost 5.3.0-40-generic fix rmw_get_graph_guard_condition #32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:

  1. In one terminal, run docker run -it --rm ros:eloquent-ros-base ros2 topic pub /chatter std_msgs/String "{data: 'test'}" and leave it running.
  2. In another terminal, run docker run -it --rm ros:eloquent-ros-base ros2 topic echo /chatter std_msgs/String
  3. Wait a couple seconds, then observe if any messages from the container in (1) were printed out. Stop the command (Ctrl+C)
  4. Repeat (2) and (3) a few more times

With 10 runs of the subscriber, it printed messages on exactly one of them - the first one. I repeated this test twice with the same results.

Other observations:

  • When I run with --pid=host, the subscriber process connects every time.
  • Using a custom network (docker create network testing, and adding --net testing to the commands 1 & 2 above) without any changes causes the same problem (only connects on the first try, all subsequent tries fail)
  • Using a custom network and switching the IP address of the subscriber (docker network create --subnet=172.18.0.0/16 testing, --net testing on the publisher, --net testing --ip="172.18.0.#" on the subscriber with # changing) also results in the subscriber connecting for every unique address used. If I try to reuse an IP from a previous run, the subscription fails.

Hypothesis:

At this point, my guess is that the publisher is holding an open reference to the IP address and pid of the subscriber, which is causing it to ignore when the subscriber disconnects and reconnects. Rotating either of these on the subscriber "fixes" the problem, but this is still not great and could cause "transient" errors over time (e.g. when the PID counter rolls over - the default linux /proc/sys/kernel/pid_max value is 32768 and this is shared by all host processes).

Is there someone with some background on the rmw publisher code that could look into this?

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/robotics-distributed-system-based-on-kubernetes/12558/54

@fujitatomoya
Copy link
Collaborator Author

@smartin015

i think that your expectation is correct,

Is there someone with some background on the rmw publisher code that could look into this?

https://github.com/eProsima/Fast-RTPS/blob/8492d5a9eae9afc12e97d3c65fc072d4dd873383/src/cpp/rtps/RTPSDomain.cpp#L138-L158

@MiguelCompany
Copy link
Collaborator

@smartin015 @fujitatomoya

I think we could start by changing the Host singleton. On POSIX based systems we could return a value based on the 32-bit integer returned by gethostid(), but I don't know if that would also depend on the IP address being used.

In the future, we may change the way GUIDs are computed in the place @fujitatomoya pointed to, perhaps extending the size for the host part to 32 bits and only using 16 bits for the participant index part. That would be a bigger change, though.

@fujitatomoya
Copy link
Collaborator Author

@MiguelCompany

thanks for the information,

On POSIX based systems we could return a value based on the 32-bit integer returned by gethostid(), but I don't know if that would also depend on the IP address being used.

The 32-bit identifier is intended to be unique among all UNIX systems in existence. (from man GETHOSTID(3))

using host id will fix this problem, since random container name is generated based on random id. (https://github.com/docker/engine/blob/e6d949b9e707c55700c545614d25713bb191aed8/daemon/names.go#L38-L56)

@smartin015
Copy link

smartin015 commented Apr 14, 2020

If GETHOSTID stably and uniquely identifies the current host, won't that put us in the same boat as we're seeing when a ROS2 process running in Docker has the same IP and PID, just with a native host? I would expect the reconnect problems to be worse then, since the GUID would never change and would be identical for all processes on the same host.

Or are you suggesting tacking GETHOSTID onto the existing GUID? That would certainly add entropy, but I don't know if it solves the root problem of the publisher not realizing when a subscriber is reconnecting.

I know next to nothing about FastRTPS... is there some other indicator (some kind of session ID, sequence ID, health check etc) that we could check to make the publisher forget about a previously subscribed listener?

@fujitatomoya
Copy link
Collaborator Author

@smartin015

All we need is unique identification to tell the host, so I believe the host id can be used.

won't that put us in the same boat as we're seeing when a ROS2 process running in Docker has the same IP and PID, just with a native host?

container would have the same IP and PID, but not host id. so i do not think there would be a problem.

@MiguelCompany
Copy link
Collaborator

@fujitatomoya @smartin015 Have you tried with Foxy? I think this may have been solved since version 1.10.0 of Fast RTPS

@fujitatomoya
Copy link
Collaborator Author

@MiguelCompany

thanks, i will confirm it and get back to you.

@fujitatomoya
Copy link
Collaborator Author

@MiguelCompany

using ros2/ros2@0444bff, i confirmed that the problem cannot be reproducible. closing this issue.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-on-kubernetes/17182/13

@stevewolter
Copy link

Quick heads-up: This issue is still around in head FastRTPS (eProsima/Fast-DDS#1633), and will bite everyone who tries to run two ROS containers in the same Kubernetes pod.

@fujitatomoya
Copy link
Collaborator Author

@stevewolter

thanks for the heads-up, we will look into eProsima/Fast-DDS#1633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants