qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] net: add initial support for AF_XDP network backend


From: Stefan Hajnoczi
Subject: Re: [PATCH] net: add initial support for AF_XDP network backend
Date: Thu, 29 Jun 2023 14:35:49 +0200

On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> 
> > > wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> 
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > > <i.maximets@ovn.org> wrote:
> > > > > > > >
> > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > > <i.maximets@ovn.org> wrote:
> > > > > > > > >>
> > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > >>> <jasowang@redhat.com> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > >>>> <i.maximets@ovn.org> wrote:
> > > > > > > > >> It is noticeably more performant than a tap with vhost=on in 
> > > > > > > > >> terms of PPS.
> > > > > > > > >> So, that might be one case.  Taking into account that just 
> > > > > > > > >> rcu lock and
> > > > > > > > >> unlock in virtio-net code takes more time than a packet 
> > > > > > > > >> copy, some batching
> > > > > > > > >> on QEMU side should improve performance significantly.  And 
> > > > > > > > >> it shouldn't be
> > > > > > > > >> too hard to implement.
> > > > > > > > >>
> > > > > > > > >> Performance over virtual interfaces may potentially be 
> > > > > > > > >> improved by creating
> > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring 
> > > > > > > > >> allows.  Currently
> > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that 
> > > > > > > > >> doesn't allow to
> > > > > > > > >> scale well.
> > > > > > > > >
> > > > > > > > > Interestingly, actually, there are a lot of "duplication" 
> > > > > > > > > between
> > > > > > > > > io_uring and AF_XDP:
> > > > > > > > >
> > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > 2) both use ring for communication
> > > > > > > > >
> > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > >
> > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then 
> > > > > > > > we can
> > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. 
> > > > > > > > for
> > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able 
> > > > > > > > to
> > > > > > > > perform transmission for us.
> > > > > > >
> > > > > > > It would be nice if we can use iothread/vhost other than the main 
> > > > > > > loop
> > > > > > > even if io_uring can use kthreads. We can avoid the memory 
> > > > > > > translation
> > > > > > > cost.
> > > > > >
> > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
> > > > > > working
> > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > >
> > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > seems expensive.
> > > > >
> > > > > Vhost seems to be a shortcut for this.
> > > >
> > > > I'm not sure how exactly you're thinking of using io_uring.
> > > >
> > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > umem.
> > >
> > > So there would be a translation from GPA to HVA (unless io_uring
> > > support 2 stages) which needs to go via qemu memory core. And this
> > > part seems to be very expensive according to my test in the past.
> >
> > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > happening. So the GPA to HVA translation will happen anyway in device
> > emulation.
>
> Just to make sure we're on the same page.
>
> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> QEMU netdev, it would be very hard to achieve that if we stick to
> using the Qemu memory core translations which need to take care about
> too much extra stuff. That's why I suggest using vhost in io threads
> which only cares about ram so the translation could be very fast.

What does using "vhost in io threads" mean? Is that a vhost kernel
approach where userspace dedicates threads (the stuff that Mike
Christie has been working on)? I haven't looked at how Mike's recent
patches work, but I wouldn't call that approach QEMU IOThreads,
because the threads probably don't run the AioContext event loop and
instead execute vhost kernel code the entire time.

But despite these questions, I think I'm beginning to understand that
you're proposing a vhost_net.ko AF_XDP implementation instead of a
userspace QEMU AF_XDP netdev implementation. I wonder if any
optimizations can be made when the AF_XDP user is kernel code instead
of userspace code.

> >
> > Are you thinking about AF_XDP passthrough where the guest directly
> > interacts with AF_XDP?
>
> This could be another way to solve, since it won't use Qemu's memory
> core to do the translation.
>
> >
> > > > If umem encompasses guest memory,
> > >
> > > It requires you to pin the whole guest memory and a GPA to HVA
> > > translation is still required.
> >
> > Ilya mentioned that umem uses relative offsets instead of absolute
> > memory addresses. In the AF_XDP passthrough case this means no address
> > translation needs to be added to AF_XDP.
>
> I don't see how it can avoid the translations as it works at the level
> of HVA. But what guests submit is PA or even IOVA.

In a passthrough scenario the guest is doing AF_XDP, so it writes
relative umem offsets, thereby eliminating address translation
concerns (the addresses are not PAs or IOVAs). However, this approach
probably won't work well with memory hotplug - or at least it will end
up becoming a memory translation mechanism in order to support memory
hotplug.

>
> What's more, guest memory could be backed by different memory
> backends, this means a single umem may not even work.

Maybe. I don't know the nature of umem. If there can be multiple vmas
in the umem range, then there should be no issue mixing different
memory backends.

>
> >
> > Regarding pinning - I wonder if that's something that can be refined
> > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > of umem. That way only rx and tx buffers that are currently in use
> > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > pages. I'm not sure whether it's possible to implement this, I haven't
> > checked the kernel code.
>
> It requires the device to do page faults which is not commonly
> supported nowadays.

I don't understand this comment. AF_XDP processes each rx/tx
descriptor. At that point it can getuserpages() or similar in order to
pin the page. When the memory is no longer needed, it can put those
pages. No fault mechanism is needed. What am I missing?

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]