Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EF_AF_XDP_ZEROCOPY env var, not working and hence performance of app, degraded or on-par #5

Closed
shirshen12 opened this issue Dec 16, 2020 · 28 comments

Comments

@shirshen12
Copy link

shirshen12 commented Dec 16, 2020

Hello Onload Team,

I have been testing Onload with AF_XDP support on 10 GBe Intel 82599. I have updated the driver of the NiC to 5.9.4. This version has the support for Zero Copy primitive of AF_XDP as referenced here:

Screenshot 2020-12-16 at 10 33 24 PM

But when I start onload by editing the latency profile (made a new copy to latency-af-xdp.opf), please see below:

Screenshot 2020-12-16 at 10 36 38 PM

and start the application, I see no stacks being created.

Screenshot 2020-12-16 at 10 37 46 PM

Screenshot 2020-12-16 at 10 39 55 PM

Also, as a result, the config var is not set either.

Requesting your team to look at this.

Please see: When EF_AF_XDP_ZEROCOPY is not enabled, the stacks are created fine and with ixgbe ZC enabled, for payloads under 3KB, I do see a marginal increase of 5% in throughput.

@maciejj-xilinx
Copy link
Contributor

maciejj-xilinx commented Dec 17, 2020

Looking at https://elixir.bootlin.com/linux/v4.18/source/net/xdp/xdp_umem.c#L43 , vanilla 4.18 kernel requires driver to support XDP_QUERY_XSK_UMEM command. Looking at ixgbe-5.9.4/src/ixgbe_main.c it does not support the XDP_QUERY_XSK_UMEM command.

It kind of indicates that ixgbe's support for AF_XDP Zerocopy is not compatible with 4.18 (I have not checked RHEL kernel backports though).
Kernel removed the requirement for drivers to implement XDP_QUERY_XSK_UMEM in version 4.20

@shirshen12
Copy link
Author

shirshen12 commented Dec 17, 2020

So @maciejj-xilinx do you suggest that, I must bump the kernel version to 5.3+ ? or downgrade the ixgbe driver This might be a problem in general since Centos8/RHEL 8 is usually the deployed OS is as RHEL8+ (given that we are dealing with stability issues etc,)

Any advice is appreciated ?

@shirshen12
Copy link
Author

shirshen12 commented Dec 17, 2020

So @maciejj-xilinx as mentioned by you, a patch was indeed applied to remove the XSK_QUERY_XSK_UMEM support. Here is that patch which killed this., https://patchwork.ozlabs.org/project/netdev/patch/20190213170729.13845-1-bjorn.topel@gmail.com/

So, it does look like I need to find the right driver version for triggering this command.

@maciejj-xilinx
Copy link
Contributor

maciejj-xilinx commented Dec 17, 2020

It looks that rhel 8.2 with kernel linux-4.18.0-193.el8 contains the fix.

Alternatively, I can offer a patch that applies onto ixgbe 5.9.4 driver source
Edit: It actually does not look ixgbe 5.9.4 gets compiled with AF_XDP ZC support, at least on rhel 8.0

@shirshen12
Copy link
Author

@maciejj-xilinx the latest driver is not compiling with the patch.

@shirshen12
Copy link
Author

Sorry just saw your message. Yes you are right.

@shirshen12
Copy link
Author

shirshen12 commented Dec 17, 2020

The minimum driver version with AF_XDP with ZC support for ixgbe is: ixgbe-5.6.5 . The stock version of the driver that comes bundled with Centos 8.1 is ixgbe-5.1.x, So, can we apply the patch ixgbe-5.6.5

@maciejj-xilinx
Copy link
Contributor

Currently, most extensively Onload with AF_XDP Zerocopy is tested with Ubuntu 20.04 with 5.4.0-42-generic kernel.

@shirshen12
Copy link
Author

shirshen12 commented Dec 17, 2020

OK thanks @maciejj-xilinx, let me test with that and get some validation numbers and revert. Wanted to let you know that I will start a big compatibility test once the numbers look good on one platform:

Combination:
ixgbe, i40e - Centos/Red Hat 8.X

VM and container stuff will focus on coming weeks, based on virtio and virtio-hardware-offload.

@maciejj-xilinx
Copy link
Contributor

My understanding now is that rhel8.2 kernel 4.18.0-193.el8.x86_64 with its in-distro driver ixgbe version 5.1.0-k-rh8.2.0 support AF_XDP zerocopy out of the box.

@shirshen12
Copy link
Author

Hello @maciejj-xilinx , I have been to validate the above claim. The stack is being created.

[root@shirbare13 profiles]# onload -p latency-af-xdp memcached -m 24576 -c 1024 -t 8 -u root -l 45.76.38.79:11211
oo:memcached[83895]: Using Onload 20201218 [1]
oo:memcached[83895]: Copyright 2019-2020 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

I have also verified via onload_stackdump command
[root@shirbare13 ~]# onload_stackdump lots | grep XDP
EF_XDP_MODE: 0
EF_AF_XDP_ZEROCOPY: 1 (default: 0)
env: EF_AF_XDP_ZEROCOPY=1

Let me run some benchmarks now.

@shirshen12
Copy link
Author

Hello @maciejj-xilinx ,

I tried to run some benchmarks on memcached with stock Intel Driver and performance is completely degraded with EF_AF_XDP_ZEROCOPY enabled. Please see details below:

Driver version:
[root@shirbare15 ~]# ethtool -i enp1s0
driver: ixgbe
version: 5.1.0-k-rh8.2.0

EF_AF_XDP_ZEROCOPY enabled:
[root@shirbare14 ~]# onload_stackdump lots | grep XDP
EF_XDP_MODE: 0
EF_AF_XDP_ZEROCOPY: 1 (default: 0)
env: EF_AF_XDP_ZEROCOPY=1

The profile enabled is same as above and stacks are being created and have started memcached.
[root@shirbare14 profiles]# onload -p latency-af-xdp memcached -m 24576 -c 1024 -t 8 -u root -l 140.82.8.233:11211
oo:memcached[4612]: Using Onload 20201219 [7]
oo:memcached[4612]: Copyright 2019-2020 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

The TPs for 16byte Key and 64 byte value size is:
[root@shirbare15 ~]# ./memcached-bench.py
^[[A16,64,Run time: 60.0s Ops: 425530 TPS: 7090 Net_rate: 0.8M/s

As compared to 362K for same payload profile which is a degradation of 50X :D

When I revert back to the 5.9.4 driver again I get back the same performance with just starting onload and ZC mode disabled with 5% enhancement.

I am not sure whats going on. I will check with Ubuntu for now since you had mentioned its well tested there.

@shirshen12
Copy link
Author

Hi @maciejj-xilinx ,

I have been testing the Ubuntu 20.04 release with Onload. It does look relatively stable and numbers in terms of TPS (Transactions Per Second) is like 380K with latency profile on Intel 82599, ixgbe driver, upgraded to 5.9.4.

But when I enable the AF_XDP mode in latency.opf, I keep getting this error stack in dmesg for kernel release: 5.4.0-54-generic.

Please see below image. The server once onloaded, never returns any response. Any help is appreciated.

Screenshot 2020-12-19 at 3 47 17 PM

@shirshen12
Copy link
Author

Hi @maciejj-xilinx , I decided to dig a bit further on Ubuntu 20.04, 5.4.0-54-generic. When memcached is onload-ed, the below pic show the ncurses based stack frame from perf top -p <PID>

Screenshot 2020-12-19 at 9 16 06 PM

We see an exhorbitant time, roughly 75%, being spent in onload epoll routines roughly and just 1% time being spent on memcached actual function for retrieving the values.

As a comparison, when memcached runs in kernel mode, please see stack frame snapshot below:

Screenshot 2020-12-19 at 9 24 10 PM

Barring do_syscall_64(), which wraps a perf call, the memcached function is being exercised 2.10% times, exactly double!

and Poll Mode Driver of ixgbe, is also roughly at 2%.

Is there anything we can do to reduce the massive onload overhead ? In the ef_vi mode with Solarflare NiCs, this functions normally accounted for 20% overhead at max.

@maciejj-xilinx
Copy link
Contributor

Hi @maciejj-xilinx ,

I have been testing the Ubuntu 20.04 release with Onload. It does look relatively stable and numbers in terms of TPS (Transactions Per Second) is like 380K with latency profile on Intel 82599, ixgbe driver, upgraded to 5.9.4.

But when I enable the AF_XDP mode in latency.opf, I keep getting this error stack in dmesg for kernel release: 5.4.0-54-generic.

Please see below image. The server once onloaded, never returns any response. Any help is appreciated.

Screenshot 2020-12-19 at 3 47 17 PM

This looks like crash during stack clean-up. We have raised internal issue ON-12824

@maciejj-xilinx
Copy link
Contributor

With regards to overhead. With latency profile Onload tends to spin - that is busy loop on the network device queues.
Spinning is good for latency but defeats power efficiency. The overhead is very little if application is close to its capacity.

The test shows throughput of 380K TPS @ 1kB payload. That is roughly 3Gbps - far from saturating the link. And gives indication that link should get saturated at around 1M TPS.

Ideally, your client would have at least 1M TPS capacity then. If you are not certain the client has enough ooomph, you can use 3 clients in parallel.
Note that throughput can be limited by latency if the client does not offer enough concurrency or pipelining.

@shirshen12
Copy link
Author

@maciejj-xilinx is EF_STACK_PER_THREAD a valid option for Onload in AF_XDP mode ?

@maciejj-xilinx
Copy link
Contributor

The trouble with memcached is that it creates single listen (unless it has changed rcently) socket and by that fact alone only single stack is created as all the accepted sockets end up in the same stack.

In our recent whitepaper on memcached with Onload (https://china.xilinx.com/publications/resutls/onload-memcached-performance-results.pdf ) we used a separate memcached instance per core.

In the past we have found multiple points of contention in memcached with multiple threads - not just the listening socket.

@shirshen12
Copy link
Author

I think the above fact is validated by this study.

I will also post some numbers shortly for Key=16bytes and Value=64 bytes.

@shirshen12
Copy link
Author

The trouble with memcached is that it creates single listen (unless it has changed rcently) socket and by that fact alone only single stack is created as all the accepted sockets end up in the same stack.

I think the above fact is validated by this study.

In our recent whitepaper on memcached with Onload (https://china.xilinx.com/publications/resutls/onload-memcached-performance-results.pdf ) we used a separate memcached instance per core.

Yes, and I also checked the detailed work on memached benchmarking here. Its little on the older side.

In the past we have found multiple points of contention in memcached with multiple threads - not just the listening socket.

@maciejj-xilinx
Copy link
Contributor

I'd expect this to work in general as long as there is enough hardware queues set up on the NIC to allow creation of stacks.
There are some limitations with onload that make EF_STACK_PER_THREAD not taking effect, just like that of memcached - a server creating a common listening socket instead of creating a separate socket in each thread.
With AF_XDP each thread would need to listen on a separate port as SO_REUSEPORT is not yet supported with AF_XDP onload.

@shirshen12
Copy link
Author

Any plans on supporting SO_REUSEPORT on AF_XDP for Onload ?

@maciejj-xilinx
Copy link
Contributor

Any plans on supporting SO_REUSEPORT on AF_XDP for Onload ?

We are currently working on it.

@shirshen12
Copy link
Author

Helo @maciejj-xilinx ,

Any status on the SO_REUSEPORT on AF_XDP for Onload ?

@maciejj-xilinx
Copy link
Contributor

Hi, we are still figuring out internally how to best share our roadmap.
The feature in question is in plan for our next release. Worth noting this is not as trivial a feature.

@shirshen12
Copy link
Author

Sounds good, thanks for the update @maciejj-xilinx

@h2cw2l
Copy link

h2cw2l commented Apr 9, 2021

@shirshen12, @maciejj-xilinx
Hello, the information of NIC and OS are as below, can onload run on this device ?

NIC:
[root@A03-R05-I139-66-FVP3HP2 onload]# ethtool -i eth0
driver: ixgbe
version: 5.1.0-k-rh8.2.0
firmware-version: 0x8000090c, 18.3.6
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

OS:
[root@A03-R05-I139-66-FVP3HP2 onload]# cat /proc/version
Linux version 4.18.0-240.10.1.el8_3.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Mon Jan 18 17:05:51 UTC 2021
[root@A03-R05-I139-66-FVP3HP2 onload]#

Thank you very much.

@shirshen12
Copy link
Author

@h2cw2l I think it can. Please see below instructions for Onload on XDP for Intel 82599 on RHEL 8.X

yum update -y
yum install binutils gettext gawk gcc sed make bash glibc-common libcap-devel libmnl-devel perl-Test-Harness hmaccalc zlib-devel binutils-devel elfutils-libelf-devel libevent-devel

Remove unused kernels
dnf remove $(dnf repoquery --installonly --latest-limit=-1 -q)

Install latest Intel ixgbe driver

  1. Download latest driver, wget https://downloadmirror.intel.com/14687/eng/ixgbe-5.9.4.tar.gz
  2. rpmbuild -tb ixgbe-<x.x.x>.tar.gz
  3. cd /root/rpmbuild/RPMS/x86_64 && dnf/yum localinstall
  4. rmmod ixgbe; modprobe ixgbe
  5. ethtool -i enp1s0
    yum groupinstall "Development Tools"

Enable hugepages

https://www.golinuxcloud.com/configure-hugepages-vm-nr-hugepages-red-hat-7/
git clone https://github.com/Xilinx-CNS/onload.git

optional till Onload fixes it master branch:

git reset --hard e9d90b2
cd onload
scripts/onload_mkdist --release
cd onload-20201212/scripts/
./onload_install
./onload_tool reload

install XDP tools

yum install clang llvm
dnf --enablerepo=PowerTools install libpcap-devel
cd ~
git clone https://github.com/xdp-project/xdp-tools.git
git submodule update
cd xdp-tools
./configure && make && make install
echo enp1s0 > /sys/module/sfc_resource/afxdp/register

enable the flow director

ethtool --features enp1s0 ntuple on

enable port 11211

firewall-cmd --zone=public --permanent --add-service=memcache

install python2 for our benchmarking script

yum install python2
echo 1000000000 > /proc/sys/kernel/shmmax
echo 800 > /proc/sys/vm/nr_hugepages

install screen

dnf install epel-release -y
yum install screen -y

optional stuff

adduser s.chakrabarti
passwd s.chakrabarti
id s.chakrabarti
usermod -aG wheel s.chakrabarti

cns-ci-onload-xilinx pushed a commit that referenced this issue Feb 1, 2023
Deferring oo_exit_hook() fixes a stuck C++ application:

    #0  0x00007fd2d7afb87b in ioctl () from /lib64/libc.so.6
    #1  0x00007fd2d80c0621 in oo_resource_op (cmd=3221510722, io=0x7ffd15be696c, fp=<optimized out>) at /home/iteterev/lab/onload_internal/src/include/onload/mmap.h:104
    #2  __oo_eplock_lock (timeout=<synthetic pointer>, maybe_wedged=0, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:35
    #3  __ef_eplock_lock_slow (ni=ni@entry=0x20c8480, timeout=timeout@entry=-1, maybe_wedged=maybe_wedged@entry=0) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:72
    #4  0x00007fd2d80d7dbf in ef_eplock_lock (ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/onload/eplock.h:61
    #5  __ci_netif_lock_count (stat=0x7fd2d5c5b62c, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/ci/internal/ip_shared_ops.h:79
    #6  ci_tcp_setsockopt (ep=ep@entry=0x20c8460, fd=6, level=level@entry=1, optname=optname@entry=9, optval=optval@entry=0x7ffd15be6acc, optlen=optlen@entry=4) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/tcp_sockopts.c:580
    #7  0x00007fd2d8010da7 in citp_tcp_setsockopt (fdinfo=0x20c8420, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/tcp_fd.c:1594
    #8  0x00007fd2d7fde088 in onload_setsockopt (fd=6, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/sockcall_intercept.c:737
    #9  0x00007fd2d7dcb7dd in ?? ()
    #10 0x00007fd2d83392e0 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #11 0x000000000060102c in data_start ()
    #12 0x00007fd2d8339540 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #13 0x00000001d85426c0 in ?? ()
    #14 0x00007fd2d7fcbe08 in ?? ()
    #15 0x00007fd2d7a433c7 in __cxa_finalize () from /lib64/libc.so.6
    #16 0x00007fd2d7dcb757 in ?? ()
    #17 0x00007ffd15be6be0 in ?? ()
    #18 0x00007fd2d834f2a6 in _dl_fini () from /lib64/ld-linux-x86-64.so.2

Here, _fini() is a function that calls all library destructors. The
problem is that _fini() decides to run the C++ library destructor
*after* Onload and makes it operate on an invalid Onload state.

The patch leverages the fact that Glibc sets up _fini() after running
the last library constructor, so by manually installing the exit handler
(instead of providing a library destructor), Onload wins the race with
_fini().

There's still an issue if the user library sets a custom exit handler
with atexit() or on_exit() and makes intercepted system calls from
there.

Tested:

* RHEL 7.9/glibc 2.17
* RHEL 8.2/glibc 2.28
* RHEL 9.1/glibc 2.34

Thanks-to: Richard Hughes <rhughes@xilinx.com>
Thanks-to: Siân James <sian.james@xilinx.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants