EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

shirshen12 · 2020-12-16T17:14:38Z

Hello Onload Team,

I have been testing Onload with AF_XDP support on 10 GBe Intel 82599. I have updated the driver of the NiC to 5.9.4. This version has the support for Zero Copy primitive of AF_XDP as referenced here:

But when I start onload by editing the latency profile (made a new copy to latency-af-xdp.opf), please see below:

and start the application, I see no stacks being created.

Also, as a result, the config var is not set either.

Requesting your team to look at this.

Please see: When EF_AF_XDP_ZEROCOPY is not enabled, the stacks are created fine and with ixgbe ZC enabled, for payloads under 3KB, I do see a marginal increase of 5% in throughput.

maciejj-xilinx · 2020-12-17T13:57:32Z

Looking at https://elixir.bootlin.com/linux/v4.18/source/net/xdp/xdp_umem.c#L43 , vanilla 4.18 kernel requires driver to support XDP_QUERY_XSK_UMEM command. Looking at ixgbe-5.9.4/src/ixgbe_main.c it does not support the XDP_QUERY_XSK_UMEM command.

It kind of indicates that ixgbe's support for AF_XDP Zerocopy is not compatible with 4.18 (I have not checked RHEL kernel backports though).
Kernel removed the requirement for drivers to implement XDP_QUERY_XSK_UMEM in version 4.20

shirshen12 · 2020-12-17T14:42:55Z

So @maciejj-xilinx do you suggest that, I must bump the kernel version to 5.3+ ? or downgrade the ixgbe driver This might be a problem in general since Centos8/RHEL 8 is usually the deployed OS is as RHEL8+ (given that we are dealing with stability issues etc,)

Any advice is appreciated ?

shirshen12 · 2020-12-17T17:32:05Z

So @maciejj-xilinx as mentioned by you, a patch was indeed applied to remove the XSK_QUERY_XSK_UMEM support. Here is that patch which killed this., https://patchwork.ozlabs.org/project/netdev/patch/20190213170729.13845-1-bjorn.topel@gmail.com/

So, it does look like I need to find the right driver version for triggering this command.

maciejj-xilinx · 2020-12-17T17:49:30Z

It looks that rhel 8.2 with kernel linux-4.18.0-193.el8 contains the fix.

~~Alternatively, I can offer a patch that applies onto ixgbe 5.9.4 driver source~~
Edit: It actually does not look ixgbe 5.9.4 gets compiled with AF_XDP ZC support, at least on rhel 8.0

shirshen12 · 2020-12-17T17:55:16Z

@maciejj-xilinx the latest driver is not compiling with the patch.

shirshen12 · 2020-12-17T17:55:48Z

Sorry just saw your message. Yes you are right.

shirshen12 · 2020-12-17T17:59:16Z

The minimum driver version with AF_XDP with ZC support for ixgbe is: ixgbe-5.6.5 . The stock version of the driver that comes bundled with Centos 8.1 is ixgbe-5.1.x, So, can we apply the patch ixgbe-5.6.5

maciejj-xilinx · 2020-12-17T18:06:06Z

Currently, most extensively Onload with AF_XDP Zerocopy is tested with Ubuntu 20.04 with 5.4.0-42-generic kernel.

shirshen12 · 2020-12-17T18:13:01Z

OK thanks @maciejj-xilinx, let me test with that and get some validation numbers and revert. Wanted to let you know that I will start a big compatibility test once the numbers look good on one platform:

Combination:
ixgbe, i40e - Centos/Red Hat 8.X

VM and container stuff will focus on coming weeks, based on virtio and virtio-hardware-offload.

maciejj-xilinx · 2020-12-18T09:38:01Z

My understanding now is that rhel8.2 kernel 4.18.0-193.el8.x86_64 with its in-distro driver ixgbe version 5.1.0-k-rh8.2.0 support AF_XDP zerocopy out of the box.

shirshen12 · 2020-12-18T16:48:36Z

Hello @maciejj-xilinx , I have been to validate the above claim. The stack is being created.

[root@shirbare13 profiles]# onload -p latency-af-xdp memcached -m 24576 -c 1024 -t 8 -u root -l 45.76.38.79:11211
oo:memcached[83895]: Using Onload 20201218 [1]
oo:memcached[83895]: Copyright 2019-2020 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

I have also verified via onload_stackdump command
[root@shirbare13 ~]# onload_stackdump lots | grep XDP
EF_XDP_MODE: 0
EF_AF_XDP_ZEROCOPY: 1 (default: 0)
env: EF_AF_XDP_ZEROCOPY=1

Let me run some benchmarks now.

shirshen12 · 2020-12-19T05:37:48Z

Hello @maciejj-xilinx ,

I tried to run some benchmarks on memcached with stock Intel Driver and performance is completely degraded with EF_AF_XDP_ZEROCOPY enabled. Please see details below:

Driver version:
[root@shirbare15 ~]# ethtool -i enp1s0
driver: ixgbe
version: 5.1.0-k-rh8.2.0

EF_AF_XDP_ZEROCOPY enabled:
[root@shirbare14 ~]# onload_stackdump lots | grep XDP
EF_XDP_MODE: 0
EF_AF_XDP_ZEROCOPY: 1 (default: 0)
env: EF_AF_XDP_ZEROCOPY=1

The profile enabled is same as above and stacks are being created and have started memcached.
[root@shirbare14 profiles]# onload -p latency-af-xdp memcached -m 24576 -c 1024 -t 8 -u root -l 140.82.8.233:11211
oo:memcached[4612]: Using Onload 20201219 [7]
oo:memcached[4612]: Copyright 2019-2020 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

The TPs for 16byte Key and 64 byte value size is:
[root@shirbare15 ~]# ./memcached-bench.py
^[[A16,64,Run time: 60.0s Ops: 425530 TPS: 7090 Net_rate: 0.8M/s

As compared to 362K for same payload profile which is a degradation of 50X :D

When I revert back to the 5.9.4 driver again I get back the same performance with just starting onload and ZC mode disabled with 5% enhancement.

I am not sure whats going on. I will check with Ubuntu for now since you had mentioned its well tested there.

shirshen12 · 2020-12-19T10:18:20Z

Hi @maciejj-xilinx ,

I have been testing the Ubuntu 20.04 release with Onload. It does look relatively stable and numbers in terms of TPS (Transactions Per Second) is like 380K with latency profile on Intel 82599, ixgbe driver, upgraded to 5.9.4.

But when I enable the AF_XDP mode in latency.opf, I keep getting this error stack in dmesg for kernel release: 5.4.0-54-generic.

Please see below image. The server once onloaded, never returns any response. Any help is appreciated.

shirshen12 · 2020-12-19T15:57:25Z

Hi @maciejj-xilinx , I decided to dig a bit further on Ubuntu 20.04, 5.4.0-54-generic. When memcached is onload-ed, the below pic show the ncurses based stack frame from perf top -p <PID>

We see an exhorbitant time, roughly 75%, being spent in onload epoll routines roughly and just 1% time being spent on memcached actual function for retrieving the values.

As a comparison, when memcached runs in kernel mode, please see stack frame snapshot below:

Barring do_syscall_64(), which wraps a perf call, the memcached function is being exercised 2.10% times, exactly double!

and Poll Mode Driver of ixgbe, is also roughly at 2%.

Is there anything we can do to reduce the massive onload overhead ? In the ef_vi mode with Solarflare NiCs, this functions normally accounted for 20% overhead at max.

maciejj-xilinx · 2020-12-21T13:32:38Z

Hi @maciejj-xilinx ,

I have been testing the Ubuntu 20.04 release with Onload. It does look relatively stable and numbers in terms of TPS (Transactions Per Second) is like 380K with latency profile on Intel 82599, ixgbe driver, upgraded to 5.9.4.

But when I enable the AF_XDP mode in latency.opf, I keep getting this error stack in dmesg for kernel release: 5.4.0-54-generic.

Please see below image. The server once onloaded, never returns any response. Any help is appreciated.

This looks like crash during stack clean-up. We have raised internal issue ON-12824

maciejj-xilinx · 2020-12-21T16:06:48Z

With regards to overhead. With latency profile Onload tends to spin - that is busy loop on the network device queues.
Spinning is good for latency but defeats power efficiency. The overhead is very little if application is close to its capacity.

The test shows throughput of 380K TPS @ 1kB payload. That is roughly 3Gbps - far from saturating the link. And gives indication that link should get saturated at around 1M TPS.

Ideally, your client would have at least 1M TPS capacity then. If you are not certain the client has enough ooomph, you can use 3 clients in parallel.
Note that throughput can be limited by latency if the client does not offer enough concurrency or pipelining.

shirshen12 · 2020-12-21T17:40:55Z

@maciejj-xilinx is EF_STACK_PER_THREAD a valid option for Onload in AF_XDP mode ?

maciejj-xilinx · 2020-12-22T14:51:42Z

The trouble with memcached is that it creates single listen (unless it has changed rcently) socket and by that fact alone only single stack is created as all the accepted sockets end up in the same stack.

In our recent whitepaper on memcached with Onload (https://china.xilinx.com/publications/resutls/onload-memcached-performance-results.pdf ) we used a separate memcached instance per core.

In the past we have found multiple points of contention in memcached with multiple threads - not just the listening socket.

shirshen12 · 2020-12-22T16:58:17Z

I think the above fact is validated by this study.

I will also post some numbers shortly for Key=16bytes and Value=64 bytes.

shirshen12 · 2020-12-22T17:04:39Z

The trouble with memcached is that it creates single listen (unless it has changed rcently) socket and by that fact alone only single stack is created as all the accepted sockets end up in the same stack.

I think the above fact is validated by this study.

In our recent whitepaper on memcached with Onload (https://china.xilinx.com/publications/resutls/onload-memcached-performance-results.pdf ) we used a separate memcached instance per core.

Yes, and I also checked the detailed work on memached benchmarking here. Its little on the older side.

In the past we have found multiple points of contention in memcached with multiple threads - not just the listening socket.

maciejj-xilinx · 2021-01-04T11:41:11Z

I'd expect this to work in general as long as there is enough hardware queues set up on the NIC to allow creation of stacks.
There are some limitations with onload that make EF_STACK_PER_THREAD not taking effect, just like that of memcached - a server creating a common listening socket instead of creating a separate socket in each thread.
With AF_XDP each thread would need to listen on a separate port as SO_REUSEPORT is not yet supported with AF_XDP onload.

shirshen12 · 2021-01-04T13:42:24Z

Any plans on supporting SO_REUSEPORT on AF_XDP for Onload ?

maciejj-xilinx · 2021-01-07T10:44:39Z

Any plans on supporting SO_REUSEPORT on AF_XDP for Onload ?

We are currently working on it.

shirshen12 · 2021-01-27T06:49:29Z

Helo @maciejj-xilinx ,

Any status on the SO_REUSEPORT on AF_XDP for Onload ?

maciejj-xilinx · 2021-01-28T17:24:51Z

Hi, we are still figuring out internally how to best share our roadmap.
The feature in question is in plan for our next release. Worth noting this is not as trivial a feature.

shirshen12 · 2021-01-29T04:04:14Z

Sounds good, thanks for the update @maciejj-xilinx

h2cw2l · 2021-04-09T03:58:34Z

@shirshen12, @maciejj-xilinx
Hello, the information of NIC and OS are as below, can onload run on this device ?

NIC:
[root@A03-R05-I139-66-FVP3HP2 onload]# ethtool -i eth0
driver: ixgbe
version: 5.1.0-k-rh8.2.0
firmware-version: 0x8000090c, 18.3.6
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

OS:
[root@A03-R05-I139-66-FVP3HP2 onload]# cat /proc/version
Linux version 4.18.0-240.10.1.el8_3.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Mon Jan 18 17:05:51 UTC 2021
[root@A03-R05-I139-66-FVP3HP2 onload]#

Thank you very much.

shirshen12 · 2021-06-15T16:05:09Z

@h2cw2l I think it can. Please see below instructions for Onload on XDP for Intel 82599 on RHEL 8.X

yum update -y
yum install binutils gettext gawk gcc sed make bash glibc-common libcap-devel libmnl-devel perl-Test-Harness hmaccalc zlib-devel binutils-devel elfutils-libelf-devel libevent-devel

Remove unused kernels
dnf remove $(dnf repoquery --installonly --latest-limit=-1 -q)

Install latest Intel ixgbe driver

Download latest driver, wget https://downloadmirror.intel.com/14687/eng/ixgbe-5.9.4.tar.gz
rpmbuild -tb ixgbe-<x.x.x>.tar.gz
cd /root/rpmbuild/RPMS/x86_64 && dnf/yum localinstall
rmmod ixgbe; modprobe ixgbe
ethtool -i enp1s0
yum groupinstall "Development Tools"

Enable hugepages

https://www.golinuxcloud.com/configure-hugepages-vm-nr-hugepages-red-hat-7/
git clone https://github.com/Xilinx-CNS/onload.git

optional till Onload fixes it master branch:

git reset --hard e9d90b2
cd onload
scripts/onload_mkdist --release
cd onload-20201212/scripts/
./onload_install
./onload_tool reload

install XDP tools

yum install clang llvm
dnf --enablerepo=PowerTools install libpcap-devel
cd ~
git clone https://github.com/xdp-project/xdp-tools.git
git submodule update
cd xdp-tools
./configure && make && make install
echo enp1s0 > /sys/module/sfc_resource/afxdp/register

enable the flow director

ethtool --features enp1s0 ntuple on

enable port 11211

firewall-cmd --zone=public --permanent --add-service=memcache

install python2 for our benchmarking script

yum install python2
echo 1000000000 > /proc/sys/kernel/shmmax
echo 800 > /proc/sys/vm/nr_hugepages

install screen

dnf install epel-release -y
yum install screen -y

optional stuff

adduser s.chakrabarti
passwd s.chakrabarti
id s.chakrabarti
usermod -aG wheel s.chakrabarti

Deferring oo_exit_hook() fixes a stuck C++ application: #0 0x00007fd2d7afb87b in ioctl () from /lib64/libc.so.6 #1 0x00007fd2d80c0621 in oo_resource_op (cmd=3221510722, io=0x7ffd15be696c, fp=<optimized out>) at /home/iteterev/lab/onload_internal/src/include/onload/mmap.h:104 #2 __oo_eplock_lock (timeout=<synthetic pointer>, maybe_wedged=0, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:35 #3 __ef_eplock_lock_slow (ni=ni@entry=0x20c8480, timeout=timeout@entry=-1, maybe_wedged=maybe_wedged@entry=0) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:72 #4 0x00007fd2d80d7dbf in ef_eplock_lock (ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/onload/eplock.h:61 #5 __ci_netif_lock_count (stat=0x7fd2d5c5b62c, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/ci/internal/ip_shared_ops.h:79 #6 ci_tcp_setsockopt (ep=ep@entry=0x20c8460, fd=6, level=level@entry=1, optname=optname@entry=9, optval=optval@entry=0x7ffd15be6acc, optlen=optlen@entry=4) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/tcp_sockopts.c:580 #7 0x00007fd2d8010da7 in citp_tcp_setsockopt (fdinfo=0x20c8420, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/tcp_fd.c:1594 #8 0x00007fd2d7fde088 in onload_setsockopt (fd=6, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/sockcall_intercept.c:737 #9 0x00007fd2d7dcb7dd in ?? () #10 0x00007fd2d83392e0 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so #11 0x000000000060102c in data_start () #12 0x00007fd2d8339540 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so #13 0x00000001d85426c0 in ?? () #14 0x00007fd2d7fcbe08 in ?? () #15 0x00007fd2d7a433c7 in __cxa_finalize () from /lib64/libc.so.6 #16 0x00007fd2d7dcb757 in ?? () #17 0x00007ffd15be6be0 in ?? () #18 0x00007fd2d834f2a6 in _dl_fini () from /lib64/ld-linux-x86-64.so.2 Here, _fini() is a function that calls all library destructors. The problem is that _fini() decides to run the C++ library destructor *after* Onload and makes it operate on an invalid Onload state. The patch leverages the fact that Glibc sets up _fini() after running the last library constructor, so by manually installing the exit handler (instead of providing a library destructor), Onload wins the race with _fini(). There's still an issue if the user library sets a custom exit handler with atexit() or on_exit() and makes intercepted system calls from there. Tested: * RHEL 7.9/glibc 2.17 * RHEL 8.2/glibc 2.28 * RHEL 9.1/glibc 2.34 Thanks-to: Richard Hughes <rhughes@xilinx.com> Thanks-to: Siân James <sian.james@xilinx.com>

shirshen12 closed this as completed Dec 17, 2020

shirshen12 reopened this Dec 17, 2020

shirshen12 closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

shirshen12 commented Dec 16, 2020 •

edited

Loading

maciejj-xilinx commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

maciejj-xilinx commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020

shirshen12 commented Dec 17, 2020

shirshen12 commented Dec 17, 2020 •

edited

Loading

maciejj-xilinx commented Dec 17, 2020

shirshen12 commented Dec 17, 2020 •

edited

Loading

maciejj-xilinx commented Dec 18, 2020

shirshen12 commented Dec 18, 2020

shirshen12 commented Dec 19, 2020

shirshen12 commented Dec 19, 2020

shirshen12 commented Dec 19, 2020

maciejj-xilinx commented Dec 21, 2020

maciejj-xilinx commented Dec 21, 2020

shirshen12 commented Dec 21, 2020

maciejj-xilinx commented Dec 22, 2020

shirshen12 commented Dec 22, 2020

shirshen12 commented Dec 22, 2020

maciejj-xilinx commented Jan 4, 2021

shirshen12 commented Jan 4, 2021

maciejj-xilinx commented Jan 7, 2021

shirshen12 commented Jan 27, 2021

maciejj-xilinx commented Jan 28, 2021

shirshen12 commented Jan 29, 2021

h2cw2l commented Apr 9, 2021 •

edited

Loading

shirshen12 commented Jun 15, 2021

EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

Comments

shirshen12 commented Dec 16, 2020 • edited Loading

maciejj-xilinx commented Dec 17, 2020 • edited Loading

shirshen12 commented Dec 17, 2020 • edited Loading

shirshen12 commented Dec 17, 2020 • edited Loading

maciejj-xilinx commented Dec 17, 2020 • edited Loading

shirshen12 commented Dec 17, 2020

shirshen12 commented Dec 17, 2020

shirshen12 commented Dec 17, 2020 • edited Loading

maciejj-xilinx commented Dec 17, 2020

shirshen12 commented Dec 17, 2020 • edited Loading

maciejj-xilinx commented Dec 18, 2020

shirshen12 commented Dec 18, 2020

shirshen12 commented Dec 19, 2020

shirshen12 commented Dec 19, 2020

shirshen12 commented Dec 19, 2020

maciejj-xilinx commented Dec 21, 2020

maciejj-xilinx commented Dec 21, 2020

shirshen12 commented Dec 21, 2020

maciejj-xilinx commented Dec 22, 2020

shirshen12 commented Dec 22, 2020

shirshen12 commented Dec 22, 2020

maciejj-xilinx commented Jan 4, 2021

shirshen12 commented Jan 4, 2021

maciejj-xilinx commented Jan 7, 2021

shirshen12 commented Jan 27, 2021

maciejj-xilinx commented Jan 28, 2021

shirshen12 commented Jan 29, 2021

h2cw2l commented Apr 9, 2021 • edited Loading

shirshen12 commented Jun 15, 2021

Enable hugepages

optional till Onload fixes it master branch:

install XDP tools

enable the flow director

enable port 11211

install python2 for our benchmarking script

install screen

optional stuff

shirshen12 commented Dec 16, 2020 •

edited

Loading

maciejj-xilinx commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

maciejj-xilinx commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

shirshen12 commented Dec 17, 2020 •

edited

Loading

h2cw2l commented Apr 9, 2021 •

edited

Loading