qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] net: add initial support for AF_XDP network backend


From: Jason Wang
Subject: Re: [PATCH] net: add initial support for AF_XDP network backend
Date: Mon, 26 Jun 2023 14:32:59 +0800

On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > AF_XDP is a network socket family that allows communication directly
> > with the network device driver in the kernel, bypassing most or all
> > of the kernel networking stack.  In the essence, the technology is
> > pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> > and works with any network interfaces without driver modifications.
> > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> > require access to character devices or unix sockets.  Only access to
> > the network interface itself is necessary.
> >
> > This patch implements a network backend that communicates with the
> > kernel by creating an AF_XDP socket.  A chunk of userspace memory
> > is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> > Fill and Completion) are placed in that memory along with a pool of
> > memory buffers for the packet data.  Data transmission is done by
> > allocating one of the buffers, copying packet data into it and
> > placing the pointer into Tx ring.  After transmission, device will
> > return the buffer via Completion ring.  On Rx, device will take
> > a buffer form a pre-populated Fill ring, write the packet data into
> > it and place the buffer into Rx ring.
> >
> > AF_XDP network backend takes on the communication with the host
> > kernel and the network interface and forwards packets to/from the
> > peer device in QEMU.
> >
> > Usage example:
> >
> >   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >
> > XDP program bridges the socket with a network interface.  It can be
> > attached to the interface in 2 different modes:
> >
> > 1. skb - this mode should work for any interface and doesn't require
> >          driver support.  With a caveat of lower performance.
> >
> > 2. native - this does require support from the driver and allows to
> >             bypass skb allocation in the kernel and potentially use
> >             zero-copy while getting packets in/out userspace.
> >
> > By default, QEMU will try to use native mode and fall back to skb.
> > Mode can be forced via 'mode' option.  To force 'copy' even in native
> > mode, use 'force-copy=on' option.  This might be useful if there is
> > some issue with the driver.
> >
> > Option 'queues=N' allows to specify how many device queues should
> > be open.  Note that all the queues that are not open are still
> > functional and can receive traffic, but it will not be delivered to
> > QEMU.  So, the number of device queues should generally match the
> > QEMU configuration, unless the device is shared with something
> > else and the traffic re-direction to appropriate queues is correctly
> > configured on a device level (e.g. with ethtool -N).
> > 'start-queue=M' option can be used to specify from which queue id
> > QEMU should start configuring 'N' queues.  It might also be necessary
> > to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> > for examples.
> >
> > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> > capabilities in order to load default XSK/XDP programs to the
> > network interface and configure BTF maps.
>
> I think you mean "BPF" actually?
>
> >  It is possible, however,
> > to run only with CAP_NET_RAW.
>
> Qemu often runs without any privileges, so we need to fix it.
>
> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>
>
> > For that to work, an external process
> > with admin capabilities will need to pre-load default XSK program
> > and pass an open file descriptor for this program's 'xsks_map' to
> > QEMU process on startup.  Network backend will need to be configured
> > with 'inhibit=on' to avoid loading of the programs.  The file
> > descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
> >
> > There are few performance challenges with the current network backends.
> >
> > First is that they do not support IO threads.
>
> The current networking codes needs some major recatoring to support IO
> threads which I'm not sure is worthwhile.
>
> > This means that data
> > path is handled by the main thread in QEMU and may slow down other
> > work or may be slowed down by some other work.  This also means that
> > taking advantage of multi-queue is generally not possible today.
> >
> > Another thing is that data path is going through the device emulation
> > code, which is not really optimized for performance.  The fastest
> > "frontend" device is virtio-net.  But it's not optimized for heavy
> > traffic either, because it expects such use-cases to be handled via
> > some implementation of vhost (user, kernel, vdpa).  In practice, we
> > have virtio notifications and rcu lock/unlock on a per-packet basis
> > and not very efficient accesses to the guest memory.  Communication
> > channels between backend and frontend devices do not allow passing
> > more than one packet at a time as well.
> >
> > Some of these challenges can be avoided in the future by adding better
> > batching into device emulation or by implementing vhost-af-xdp variant.
>
> It might require you to register(pin) the whole guest memory to XSK or
> there could be a copy. Both of them are sub-optimal.
>
> A really interesting project is to do AF_XDP passthrough, then we
> don't need to care about pin and copy and we will get ultra speed in
> the guest. (But again, it might needs BPF support in virtio-net).
>
> >
> > There are also a few kernel limitations.  AF_XDP sockets do not
> > support any kinds of checksum or segmentation offloading.  Buffers
> > are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> > support is not implemented for AF_XDP today.  Also, transmission in
> > all non-zero-copy modes is synchronous, i.e. done in a syscall.
> > That doesn't allow high packet rates on virtual interfaces.
> >
> > However, keeping in mind all of these challenges, current implementation
> > of the AF_XDP backend shows a decent performance while running on top
> > of a physical NIC with zero-copy support.
> >
> > Test setup:
> >
> > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> > Network backend is configured to open the NIC directly in native mode.
> > The driver supports zero-copy.  NIC is configured to use 1 queue.
> >
> > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> > for PPS testing.
> >
> > iperf3 result:
> >  TCP stream      : 19.1 Gbps
> >
> > dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >  Tx only         : 3.4 Mpps
> >  Rx only         : 2.0 Mpps
> >  L2 FWD Loopback : 1.5 Mpps
>
> I don't object to merging this backend (considering we've already
> merged netmap) once the code is fine, but the number is not amazing so
> I wonder what is the use case for this backend?

A more ambitious method is to reuse DPDK via dedicated threads, then
we can reuse any of its PMD like AF_XDP.

Thanks

>
> Thanks




reply via email to

[Prev in Thread] Current Thread [Next in Thread]