Re: [PULL 12/17] net: add initial support for AF

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 12/17] net: add initial support for AF_XDP network backend

From:	Daniel P . Berrangé
Subject:	Re: [PULL 12/17] net: add initial support for AF_XDP network backend
Date:	Fri, 8 Sep 2023 12:48:09 +0100
User-agent:	Mutt/2.2.9 (2022-11-12)

On Fri, Sep 08, 2023 at 02:45:02PM +0800, Jason Wang wrote:
> From: Ilya Maximets <i.maximets@ovn.org>
> 
> AF_XDP is a network socket family that allows communication directly
> with the network device driver in the kernel, bypassing most or all
> of the kernel networking stack.  In the essence, the technology is
> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> and works with any network interfaces without driver modifications.
> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> require access to character devices or unix sockets.  Only access to
> the network interface itself is necessary.
> 
> This patch implements a network backend that communicates with the
> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> Fill and Completion) are placed in that memory along with a pool of
> memory buffers for the packet data.  Data transmission is done by
> allocating one of the buffers, copying packet data into it and
> placing the pointer into Tx ring.  After transmission, device will
> return the buffer via Completion ring.  On Rx, device will take
> a buffer form a pre-populated Fill ring, write the packet data into
> it and place the buffer into Rx ring.
> 
> AF_XDP network backend takes on the communication with the host
> kernel and the network interface and forwards packets to/from the
> peer device in QEMU.
> 
> Usage example:
> 
>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> 
> XDP program bridges the socket with a network interface.  It can be
> attached to the interface in 2 different modes:
> 
> 1. skb - this mode should work for any interface and doesn't require
>          driver support.  With a caveat of lower performance.
> 
> 2. native - this does require support from the driver and allows to
>             bypass skb allocation in the kernel and potentially use
>             zero-copy while getting packets in/out userspace.
> 
> By default, QEMU will try to use native mode and fall back to skb.
> Mode can be forced via 'mode' option.  To force 'copy' even in native
> mode, use 'force-copy=on' option.  This might be useful if there is
> some issue with the driver.
> 
> Option 'queues=N' allows to specify how many device queues should
> be open.  Note that all the queues that are not open are still
> functional and can receive traffic, but it will not be delivered to
> QEMU.  So, the number of device queues should generally match the
> QEMU configuration, unless the device is shared with something
> else and the traffic re-direction to appropriate queues is correctly
> configured on a device level (e.g. with ethtool -N).
> 'start-queue=M' option can be used to specify from which queue id
> QEMU should start configuring 'N' queues.  It might also be necessary
> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> for examples.
> 
> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> or CAP_BPF capabilities in order to load default XSK/XDP programs to
> the network interface and configure BPF maps.  It is possible, however,
> to run with no capabilities.  For that to work, an external process
> with enough capabilities will need to pre-load default XSK program,
> create AF_XDP sockets and pass their file descriptors to QEMU process
> on startup via 'sock-fds' option.  Network backend will need to be
> configured with 'inhibit=on' to avoid loading of the program.
> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> or CAP_IPC_LOCK.
> 
> There are few performance challenges with the current network backends.
> 
> First is that they do not support IO threads.  This means that data
> path is handled by the main thread in QEMU and may slow down other
> work or may be slowed down by some other work.  This also means that
> taking advantage of multi-queue is generally not possible today.
> 
> Another thing is that data path is going through the device emulation
> code, which is not really optimized for performance.  The fastest
> "frontend" device is virtio-net.  But it's not optimized for heavy
> traffic either, because it expects such use-cases to be handled via
> some implementation of vhost (user, kernel, vdpa).  In practice, we
> have virtio notifications and rcu lock/unlock on a per-packet basis
> and not very efficient accesses to the guest memory.  Communication
> channels between backend and frontend devices do not allow passing
> more than one packet at a time as well.
> 
> Some of these challenges can be avoided in the future by adding better
> batching into device emulation or by implementing vhost-af-xdp variant.
> 
> There are also a few kernel limitations.  AF_XDP sockets do not
> support any kinds of checksum or segmentation offloading.  Buffers
> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> support implementation for AF_XDP is in progress, but not ready yet.
> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> done in a syscall.  That doesn't allow high packet rates on virtual
> interfaces.
> 
> However, keeping in mind all of these challenges, current implementation
> of the AF_XDP backend shows a decent performance while running on top
> of a physical NIC with zero-copy support.
> 
> Test setup:
> 
> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> Network backend is configured to open the NIC directly in native mode.
> The driver supports zero-copy.  NIC is configured to use 1 queue.
> 
> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> for PPS testing.
> 
> iperf3 result:
>  TCP stream      : 19.1 Gbps
> 
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>  Tx only         : 3.4 Mpps
>  Rx only         : 2.0 Mpps
>  L2 FWD Loopback : 1.5 Mpps
> 
> In skb mode the same setup shows much lower performance, similar to
> the setup where pair of physical NICs is replaced with veth pair:
> 
> iperf3 result:
>   TCP stream      : 9 Gbps
> 
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>   Tx only         : 1.2 Mpps
>   Rx only         : 1.0 Mpps
>   L2 FWD Loopback : 0.7 Mpps
> 
> Results in skb mode or over the veth are close to results of a tap
> backend with vhost=on and disabled segmentation offloading bridged
> with a NIC.


> diff --git a/tests/docker/dockerfiles/debian-amd64.docker 
> b/tests/docker/dockerfiles/debian-amd64.docker
> index 02262bc..811a7fe 100644
> --- a/tests/docker/dockerfiles/debian-amd64.docker
> +++ b/tests/docker/dockerfiles/debian-amd64.docker
> @@ -98,6 +98,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
>                        libvirglrenderer-dev \
>                        libvte-2.91-dev \
>                        libxen-dev \
> +                      libxdp-dev \
>                        libzstd-dev \
>                        llvm \
>                        locales \

As the comment at the top of the file states - this is auto-generated
by lcitool and must not be hand editted like this.

Check out docs/devel/testing.rst which has guidance on the process
for adding new package deps with lcitool/libvirt-ci.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

[Prev in Thread]

Current Thread

[Next in Thread]

[PULL 03/17] virtio-net: Add USO flags to vhost support., (continued)
- [PULL 03/17] virtio-net: Add USO flags to vhost support., Jason Wang, 2023/09/08
- [PULL 06/17] igb: rename E1000E_RingInfo_st, Jason Wang, 2023/09/08
- [PULL 08/17] igb: RX payload guest writting refactoring, Jason Wang, 2023/09/08
- [PULL 04/17] virtio-net: Add support for USO features, Jason Wang, 2023/09/08
- [PULL 07/17] igb: RX descriptors guest writting refactoring, Jason Wang, 2023/09/08
- [PULL 10/17] igb: packet-split descriptors support, Jason Wang, 2023/09/08
- [PULL 09/17] igb: add IPv6 extended headers traffic detection, Jason Wang, 2023/09/08
- [PULL 11/17] e1000e: rename e1000e_ba_state and e1000e_write_hdr_to_rx_buffers, Jason Wang, 2023/09/08
- [PULL 13/17] ebpf: Added eBPF map update through mmap., Jason Wang, 2023/09/08
- [PULL 12/17] net: add initial support for AF_XDP network backend, Jason Wang, 2023/09/08
  - Re: [PULL 12/17] net: add initial support for AF_XDP network backend, Daniel P . Berrangé <=
    - Re: [PULL 12/17] net: add initial support for AF_XDP network backend, Ilya Maximets, 2023/09/08
- [PULL 15/17] virtio-net: Added property to load eBPF RSS with fds., Jason Wang, 2023/09/08
- [PULL 14/17] ebpf: Added eBPF initialization by fds., Jason Wang, 2023/09/08
- [PULL 17/17] ebpf: Updated eBPF program and skeleton., Jason Wang, 2023/09/08
- [PULL 16/17] qmp: Added new command to retrieve eBPF blob., Jason Wang, 2023/09/08
- Re: [PULL 00/17] Net patches, Stefan Hajnoczi, 2023/09/08
  - Re: [PULL 00/17] Net patches, Ilya Maximets, 2023/09/08
    - Re: [PULL 00/17] Net patches, Daniel P . Berrangé, 2023/09/08
    - Re: [PULL 00/17] Net patches, Ilya Maximets, 2023/09/08
    - Re: [PULL 00/17] Net patches, Daniel P . Berrangé, 2023/09/08

Prev by Date: Re: [PULL 12/13] qemu-nbd: Restore "qemu-nbd -v --fork" output
Next by Date: Re: [PULL 00/17] Net patches
Previous by thread: [PULL 12/17] net: add initial support for AF_XDP network backend
Next by thread: Re: [PULL 12/17] net: add initial support for AF_XDP network backend
Index(es):
- Date
- Thread