qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] net: add initial support for AF_XDP network backend


From: Jason Wang
Subject: Re: [PATCH] net: add initial support for AF_XDP network backend
Date: Mon, 10 Jul 2023 11:51:35 +0800

On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/7/23 03:43, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>
> >> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>
> >>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> 
> >>>>> wrote:
> >>>>>>
> >>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> 
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> >>>>>>>>> <stefanha@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> 
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> >>>>>>>>>>> <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> 
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> >>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> >>>>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> >>>>>>>>>>>>>>>>> <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> >>>>>>>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in 
> >>>>>>>>>>>>>>>> terms of PPS.
> >>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just 
> >>>>>>>>>>>>>>>> rcu lock and
> >>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet 
> >>>>>>>>>>>>>>>> copy, some batching
> >>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And 
> >>>>>>>>>>>>>>>> it shouldn't be
> >>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be 
> >>>>>>>>>>>>>>>> improved by creating
> >>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring 
> >>>>>>>>>>>>>>>> allows.  Currently
> >>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that 
> >>>>>>>>>>>>>>>> doesn't allow to
> >>>>>>>>>>>>>>>> scale well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" 
> >>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>> io_uring and AF_XDP:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> >>>>>>>>>>>>>>> 2) both use ring for communication
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then 
> >>>>>>>>>>>>>> we can
> >>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. 
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be 
> >>>>>>>>>>>>>> able to
> >>>>>>>>>>>>>> perform transmission for us.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the 
> >>>>>>>>>>>>> main loop
> >>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory 
> >>>>>>>>>>>>> translation
> >>>>>>>>>>>>> cost.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> >>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
> >>>>>>>>>>>> working
> >>>>>>>>>>>> on patches to re-enable it and will probably send them in July. 
> >>>>>>>>>>>> The
> >>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations 
> >>>>>>>>>>>> so
> >>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both 
> >>>>>>>>>>>> the
> >>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux 
> >>>>>>>>>>>> hosts.
> >>>>>>>>>>>
> >>>>>>>>>>> Just to make sure I understand. If we still need a copy from 
> >>>>>>>>>>> guest to
> >>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
> >>>>>>>>>>> seems expensive.
> >>>>>>>>>>>
> >>>>>>>>>>> Vhost seems to be a shortcut for this.
> >>>>>>>>>>
> >>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> >>>>>>>>>>
> >>>>>>>>>> Simply using io_uring for the event loop (file descriptor 
> >>>>>>>>>> monitoring)
> >>>>>>>>>> doesn't involve an extra buffer, but the packet payload still 
> >>>>>>>>>> needs to
> >>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
> >>>>>>>>>> umem.
> >>>>>>>>>
> >>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> >>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> >>>>>>>>> part seems to be very expensive according to my test in the past.
> >>>>>>>>
> >>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a 
> >>>>>>>> QEMU
> >>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> >>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
> >>>>>>>> emulation.
> >>>>>>>
> >>>>>>> Just to make sure we're on the same page.
> >>>>>>>
> >>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> >>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> >>>>>>> using the Qemu memory core translations which need to take care about
> >>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> >>>>>>> which only cares about ram so the translation could be very fast.
> >>>>>>
> >>>>>> What does using "vhost in io threads" mean?
> >>>>>
> >>>>> It means a vhost userspace dataplane that is implemented via io threads.
> >>>>
> >>>> AFAIK this does not exist today. QEMU's built-in devices that use
> >>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> >>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> >>>> built-in devices implement VirtioDeviceClass callbacks directly and
> >>>> use AioContext APIs to run in IOThreads.
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>> Do you have an idea for using vhost code for built-in devices? Maybe
> >>>> it's fastest if you explain your idea and its advantages instead of me
> >>>> guessing.
> >>>
> >>> It's something like I'd proposed in [1]:
> >>>
> >>> 1) a vhost that is implemented via IOThreads
> >>> 2) memory translation is done via vhost memory table/IOTLB
> >>>
> >>> The advantages are:
> >>>
> >>> 1) No 3rd application like DPDK application
> >>> 2) Attack surface were reduced
> >>> 3) Better understanding/interactions with device model for things like
> >>> RSS and IOMMU
> >>>
> >>> There could be some dis-advantages but it's not obvious to me :)
> >>
> >> Why is QEMU's native device emulation API not the natural choice for
> >> writing built-in devices? I don't understand why the vhost interface
> >> is desirable for built-in devices.
> >
> > Unless the memory helpers (like address translations) were optimized
> > fully to satisfy this 10M+ PPS.
> >
> > Not sure if this is too hard, but last time I benchmark, perf told me
> > most of the time spent in the translation.
> >
> > Using a vhost is a workaround since its memory model is much more
> > simpler so it can skip lots of memory sections like I/O and ROM etc.
>
> So, we can have a thread running as part of QEMU process that implements
> vhost functionality for a virtio-net device.  And this thread has an
> optimized way to access memory.  What prevents current virtio-net emulation
> code accessing memory in the same optimized way?

Current emulation using memory core accessors which needs to take care
of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
considered since day0 of vhost. You can do some experiment on this e.g
just dropping packets after fetching it from the TX ring.

> i.e. we likely don't
> actually need to implement the whole vhost-virtio communication protocol
> in order to have faster memory access from the device emulation code.
> I mean, if vhost can access device memory faster, why device itself can't?

I'm not saying it can't but it would end up with something similar to
vhost. And that's why I'm saying using vhost is a shortcut (at least
for a POC).

Thanks

>
> With that we could probably split the "datapath" part of the virtio-net
> emulation into a separate thread driven by iothread loop.
>
> Then add batch API for communication with a network backend (af-xdp) to
> avoid per-packet calls.
>
> These are 3 more or less independent tasks that should allow the similar
> performance to a full fledged vhost control and dataplane implementation
> inside QEMU.
>
> Or am I missing something? (Probably)
>
> >
> > Thanks
> >
> >>
> >>>
> >>> It's something like linking SPDK/DPDK to Qemu.
> >>
> >> Sergio Lopez tried loading vhost-user devices as shared libraries that
> >> run in the QEMU process. It worked as an experiment but wasn't pursued
> >> further.
> >>
> >> I think that might make sense in specific cases where there is an
> >> existing vhost-user codebase that needs to run as part of QEMU.
> >>
> >> In this case the AF_XDP code is new, so it's not a case of moving
> >> existing code into QEMU.
> >>
> >>>
> >>>>
> >>>>>>>> Regarding pinning - I wonder if that's something that can be refined
> >>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
> >>>>>>>> of umem. That way only rx and tx buffers that are currently in use
> >>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
> >>>>>>>> pages. I'm not sure whether it's possible to implement this, I 
> >>>>>>>> haven't
> >>>>>>>> checked the kernel code.
> >>>>>>>
> >>>>>>> It requires the device to do page faults which is not commonly
> >>>>>>> supported nowadays.
> >>>>>>
> >>>>>> I don't understand this comment. AF_XDP processes each rx/tx
> >>>>>> descriptor. At that point it can getuserpages() or similar in order to
> >>>>>> pin the page. When the memory is no longer needed, it can put those
> >>>>>> pages. No fault mechanism is needed. What am I missing?
> >>>>>
> >>>>> Ok, I think I kind of get you, you mean doing pinning while processing
> >>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> >>>>> hit PPS for sure.
> >>>>
> >>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> >>>> supports unpinned guest RAM.
> >>>
> >>> Right, it's a balance between pin and PPS. PPS seems to be more
> >>> important in this case.
> >>>
> >>>>
> >>>> There are variations on this approach, like keeping a certain amount
> >>>> of pages pinned after they have been used so the cost of
> >>>> pinning/unpinning can be avoided when the same pages are reused in the
> >>>> future, but I don't know how effective that is in practice.
> >>>>
> >>>> Is there a more efficient approach without relying on hardware page
> >>>> fault support?
> >>>
> >>> I guess so, I see some slides that say device page fault is very slow.
> >>>
> >>>>
> >>>> My understanding is that hardware page fault support is not yet
> >>>> deployed. We'd be left with pinning guest RAM permanently or using a
> >>>> runtime pinning/unpinning approach like I've described.
> >>>
> >>> Probably.
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> Stefan
> >>>>
> >>>
> >>
> >
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]