qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] net: add initial support for AF_XDP network backend


From: Jason Wang
Subject: Re: [PATCH] net: add initial support for AF_XDP network backend
Date: Thu, 29 Jun 2023 13:25:49 +0800

On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> 
> > > > > > wrote:
> > > > > > >
> > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > <i.maximets@ovn.org> wrote:
> > > > > > > >>
> > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > >>> <jasowang@redhat.com> wrote:
> > > > > > > >>>>
> > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > >>>> <i.maximets@ovn.org> wrote:
> > > > > > > >> It is noticeably more performant than a tap with vhost=on in 
> > > > > > > >> terms of PPS.
> > > > > > > >> So, that might be one case.  Taking into account that just rcu 
> > > > > > > >> lock and
> > > > > > > >> unlock in virtio-net code takes more time than a packet copy, 
> > > > > > > >> some batching
> > > > > > > >> on QEMU side should improve performance significantly.  And it 
> > > > > > > >> shouldn't be
> > > > > > > >> too hard to implement.
> > > > > > > >>
> > > > > > > >> Performance over virtual interfaces may potentially be 
> > > > > > > >> improved by creating
> > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring 
> > > > > > > >> allows.  Currently
> > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that 
> > > > > > > >> doesn't allow to
> > > > > > > >> scale well.
> > > > > > > >
> > > > > > > > Interestingly, actually, there are a lot of "duplication" 
> > > > > > > > between
> > > > > > > > io_uring and AF_XDP:
> > > > > > > >
> > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > 2) both use ring for communication
> > > > > > > >
> > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > >
> > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we 
> > > > > > > can
> > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > perform transmission for us.
> > > > > >
> > > > > > It would be nice if we can use iothread/vhost other than the main 
> > > > > > loop
> > > > > > even if io_uring can use kthreads. We can avoid the memory 
> > > > > > translation
> > > > > > cost.
> > > > >
> > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > on patches to re-enable it and will probably send them in July. The
> > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > >
> > > > Just to make sure I understand. If we still need a copy from guest to
> > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > seems expensive.
> > > >
> > > > Vhost seems to be a shortcut for this.
> > >
> > > I'm not sure how exactly you're thinking of using io_uring.
> > >
> > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > doesn't involve an extra buffer, but the packet payload still needs to
> > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > umem.
> >
> > So there would be a translation from GPA to HVA (unless io_uring
> > support 2 stages) which needs to go via qemu memory core. And this
> > part seems to be very expensive according to my test in the past.
>
> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> netdev, there is already QEMU device emulation (e.g. virtio-net)
> happening. So the GPA to HVA translation will happen anyway in device
> emulation.

Just to make sure we're on the same page.

I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
QEMU netdev, it would be very hard to achieve that if we stick to
using the Qemu memory core translations which need to take care about
too much extra stuff. That's why I suggest using vhost in io threads
which only cares about ram so the translation could be very fast.

>
> Are you thinking about AF_XDP passthrough where the guest directly
> interacts with AF_XDP?

This could be another way to solve, since it won't use Qemu's memory
core to do the translation.

>
> > > If umem encompasses guest memory,
> >
> > It requires you to pin the whole guest memory and a GPA to HVA
> > translation is still required.
>
> Ilya mentioned that umem uses relative offsets instead of absolute
> memory addresses. In the AF_XDP passthrough case this means no address
> translation needs to be added to AF_XDP.

I don't see how it can avoid the translations as it works at the level
of HVA. But what guests submit is PA or even IOVA.

What's more, guest memory could be backed by different memory
backends, this means a single umem may not even work.

>
> Regarding pinning - I wonder if that's something that can be refined
> in the kernel by adding an AF_XDP flag that enables on-demand pinning
> of umem. That way only rx and tx buffers that are currently in use
> will be pinned. The disadvantage is the runtime overhead to pin/unpin
> pages. I'm not sure whether it's possible to implement this, I haven't
> checked the kernel code.

It requires the device to do page faults which is not commonly
supported nowadays.

Thanks

>
> Stefan
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]