[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] net: add initial support for AF_XDP network backend
From: |
Ilya Maximets |
Subject: |
Re: [PATCH] net: add initial support for AF_XDP network backend |
Date: |
Fri, 30 Jun 2023 17:01:21 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 |
On 6/30/23 09:44, Jason Wang wrote:
> On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/28/23 05:27, Jason Wang wrote:
>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 6/27/23 04:54, Jason Wang wrote:
>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>
>>>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> AF_XDP is a network socket family that allows communication directly
>>>>>>>>> with the network device driver in the kernel, bypassing most or all
>>>>>>>>> of the kernel networking stack. In the essence, the technology is
>>>>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native
>>>>>>>>> and works with any network interfaces without driver modifications.
>>>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>>>>>>> require access to character devices or unix sockets. Only access to
>>>>>>>>> the network interface itself is necessary.
>>>>>>>>>
>>>>>>>>> This patch implements a network backend that communicates with the
>>>>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory
>>>>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx,
>>>>>>>>> Fill and Completion) are placed in that memory along with a pool of
>>>>>>>>> memory buffers for the packet data. Data transmission is done by
>>>>>>>>> allocating one of the buffers, copying packet data into it and
>>>>>>>>> placing the pointer into Tx ring. After transmission, device will
>>>>>>>>> return the buffer via Completion ring. On Rx, device will take
>>>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>>>>>>> it and place the buffer into Rx ring.
>>>>>>>>>
>>>>>>>>> AF_XDP network backend takes on the communication with the host
>>>>>>>>> kernel and the network interface and forwards packets to/from the
>>>>>>>>> peer device in QEMU.
>>>>>>>>>
>>>>>>>>> Usage example:
>>>>>>>>>
>>>>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>>>>>>
>>>>>>>>> XDP program bridges the socket with a network interface. It can be
>>>>>>>>> attached to the interface in 2 different modes:
>>>>>>>>>
>>>>>>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>>>>>> driver support. With a caveat of lower performance.
>>>>>>>>>
>>>>>>>>> 2. native - this does require support from the driver and allows to
>>>>>>>>> bypass skb allocation in the kernel and potentially use
>>>>>>>>> zero-copy while getting packets in/out userspace.
>>>>>>>>>
>>>>>>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native
>>>>>>>>> mode, use 'force-copy=on' option. This might be useful if there is
>>>>>>>>> some issue with the driver.
>>>>>>>>>
>>>>>>>>> Option 'queues=N' allows to specify how many device queues should
>>>>>>>>> be open. Note that all the queues that are not open are still
>>>>>>>>> functional and can receive traffic, but it will not be delivered to
>>>>>>>>> QEMU. So, the number of device queues should generally match the
>>>>>>>>> QEMU configuration, unless the device is shared with something
>>>>>>>>> else and the traffic re-direction to appropriate queues is correctly
>>>>>>>>> configured on a device level (e.g. with ethtool -N).
>>>>>>>>> 'start-queue=M' option can be used to specify from which queue id
>>>>>>>>> QEMU should start configuring 'N' queues. It might also be necessary
>>>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs
>>>>>>>>> for examples.
>>>>>>>>>
>>>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>>>>>>> capabilities in order to load default XSK/XDP programs to the
>>>>>>>>> network interface and configure BTF maps.
>>>>>>>>
>>>>>>>> I think you mean "BPF" actually?
>>>>>>
>>>>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
>>>>>>
>>>>>>>>
>>>>>>>>> It is possible, however,
>>>>>>>>> to run only with CAP_NET_RAW.
>>>>>>>>
>>>>>>>> Qemu often runs without any privileges, so we need to fix it.
>>>>>>>>
>>>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>>>>>>
>>>>>> I looked through the code and it seems like we can run completely
>>>>>> non-privileged as far as kernel concerned. We'll need an API
>>>>>> modification in libxdp though.
>>>>>>
>>>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
>>>>>> a base socket creation. Binding and other configuration doesn't
>>>>>> require any privileges. So, we could create a socket externally
>>>>>> and pass it to QEMU.
>>>>>
>>>>> That's the way TAP works for example.
>>>>>
>>>>>> Should work, unless it's an oversight from
>>>>>> the kernel side that needs to be patched. :) libxdp doesn't have
>>>>>> a way to specify externally created socket today, so we'll need
>>>>>> to change that. Should be easy to do though. I can explore.
>>>>>
>>>>> Please do that.
>>>>
>>>> I have a prototype:
>>>>
>>>> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
>>>>
>>>> Need to test it out and then submit PR to xdp-tools project.
The change is now accepted:
https://github.com/xdp-project/xdp-tools/commit/740c839806a02517da5bce7bd0ccaba908b3f675
I can update the QEMU patch with support for passing socket fds. It may
look like this:
-netved af-xdp,eth0,queues=2,inhibit=on,sock-fds=fd1,fd2
We'll need an fd per queue. And we may require these fds to be already
added to the xsks map, so QEMU doesn't need xsks-map-fd.
I'd say we'll need to compile support for that conditionally based on
availability of xsk_umem__create_with_fd() as it may not be available
in distributions for some time.
Alternative is to require libxdp >= 1.4.0, which is not released yet.
The last restriction will be that QEMU will need 32 MB of RLIMIT_MEMLOCK
per queue for umem registration, but that should not be a huge deal, right?
Alternative is to have CAP_IPC_LOCK.
And I'd keep the xsks-map-fd parameter for setups that do not have latest
libxdp and can allow CAP_NET_RAW. So, they could still do:
-netdev af-xdp,eth0,queues=2,inhibit=on,xsks-map-fd=fd
What do you think?
>>>>
>>>>>
>>>>>>
>>>>>> In case the bind syscall will actually need CAP_NET_RAW for some
>>>>>> reason, we could change the kernel and allow non-privileged bind
>>>>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged
>>>>>> process bind the socket to a particular device, so QEMU can't
>>>>>> bind it to a random one. Might be a good use case to allow even
>>>>>> if not strictly necessary.
>>>>>
>>>>> Yes.
>>>>
>>>> Will propose something for a kernel as well. We might want something
>>>> more granular though, e.g. bind to a queue instead of a device. In
>>>> case we want better control in the device sharing scenario.
>>>
>>> I may miss something but the bind is already done at dev plus queue
>>> right now, isn't it?
>>
>>
>> Yes, the bind() syscall will bind socket to the dev+queue. I was talking
>> about SO_BINDTODEVICE that only ties the socket to a particular device,
>> but not a queue.
>>
>> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
>> assuming a privileged process does:
>>
>> fd = socket(AF_XDP, ...);
>> setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);
>>
>> And sends fd to a non-privileged process. That non-privileged process
>> will be able to call:
>>
>> bind(fd, <device>, <random queue>);
>>
>> It will have to use the same device, but can choose any queue, if that
>> queue is not already busy with another socket.
>>
>> So, I was thinking maybe implementing something like XDP_BINDTOQID option.
>> This way the privileged process may call:
>>
>> setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);
>>
>> And later kernel will be able to refuse bind() for any other queue for
>> this particular socket.
>
> Not sure, if file descriptor passing works, we probably don't need another
> way.
>
>>
>> Not sure if that is necessary though.
>> Since we're allocating the socket in the privileged process, that process
>> may add the socket to the BPF map on the correct queue id. This way the
>> non-privileged process will not be able to receive any packets from any
>> other queue on this socket, even if bound to it. And no other AF_XDP
>> socket will be able to be bound to that other queue as well.
>
> I think that's by design, or anything wrong with this model?
No, should be fine. I'll posted a simple SO_BINDTODEVICE change to bpf-next
as an RFC for now since the tree is closed:
https://lore.kernel.org/netdev/20230630145831.2988845-1-i.maximets@ovn.org/
Will re-send a non-RFC once it is open (after 10th of July, IIRC).
>
>> So, the
>> rogue QEMU will be able to hog one extra queue, but it will not be able
>> to intercept traffic any from it, AFAICT. May not be a huge problem
>> after all.
>>
>> SO_BINDTODEVICE would still be nice to have. Especially for cases where
>> we give the whole device to one VM.
>
> Then we need to use AF_XDP in the guest which seems to be a different
> topic. Alibaba is working on the AF_XDP support for virtio-net.
>
> Thanks
>
>>
>> Best regards, Ilya Maximets.
>>
>
- Re: [PATCH] net: add initial support for AF_XDP network backend, (continued)
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/29
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/29
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/30
- Re: [PATCH] net: add initial support for AF_XDP network backend, Ilya Maximets, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/30
- Re: [PATCH] net: add initial support for AF_XDP network backend,
Ilya Maximets <=
Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/27