qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status


From: Daniel P . Berrangé
Subject: Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update
Date: Thu, 19 Dec 2019 12:55:04 +0000
User-agent: Mutt/1.12.1 (2019-06-15)

On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote:
> Hello,
> 
> (I've added Jim and Ben from the SPDK team to the thread.)
> 
> > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <address@hidden> wrote:
> > 
> > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:
> >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <address@hidden> wrote:
> >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:
> >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <address@hidden> wrote:
> >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:
> >>> Questions I've seen when discussing muser with people have been:
> >>> 
> >>> 1. Can unprivileged containers create muser devices?  If not, this is a
> >>>  blocker for use cases that want to avoid root privileges entirely.
> >> 
> >> Yes you can. Muser device creation follows the same process as general
> >> mdev device creation (ie. you write to a sysfs path). That creates an
> >> entry in /dev/vfio and the control plane can further drop privileges
> >> there (set selinux contexts, &c.)
> > 
> > In this case there is still a privileged step during setup.  What about
> > completely unprivileged scenarios like a regular user without root or a
> > rootless container?
> 
> Oh, I see what you are saying. I suppose we need to investigate
> adjusting the privileges of the sysfs path correctly beforehand to
> allow devices to be created by non-root users. The credentials used on
> creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>).
> 
> I need to look into that and get back to you.
> 
> > 
> >>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship,
> >>>  security reasons)?  A similar library could be implemented in
> >>>  userspace along the lines of the vhost-user protocol.  Although VMMs
> >>>  would then need to use a new libmuser-client library instead of
> >>>  reusing their VFIO code to access the device.
> >> 
> >> Doing it in userspace was the flow we proposed back in last year's KVM
> >> Forum (Edinburgh), but it got turned down. That's why we procured the
> >> kernel approach, which turned out to have some advantages:
> >> - No changes needed to Qemu
> >> - No Qemu needed at all for userspace drivers
> >> - Device emulation process restart is trivial
> >>  (it therefore makes device code upgrades much easier)
> >> 
> >> Having said that, nothing stops us from enhancing libmuser to talk
> >> directly to Qemu (for the Qemu case). I envision at least two ways of
> >> doing that:
> >> - Hooking up libmuser with Qemu directly (eg. over a unix socket)
> >> - Hooking Qemu with CUSE and implementing the muser.ko interface
> >> 
> >> For the latter, libmuser would talk to a character device just like it
> >> talks to the vfio character device. We "just" need to implement that
> >> backend in Qemu. :)
> > 
> > What about:
> > * libmuser's API stays mostly unchanged but the library speaks a
> >   VFIO-over-UNIX domain sockets protocol instead of talking to
> >   mdev/vfio in the host kernel.
> 
> As I said above, there are advantages to the kernel model. The key one
> is transparent device emulation restarts. Today, muser.ko keeps the
> "device memory" internally in a prefix tree. Upon restart, a new
> device emulator can recover state (eg. from a state file in /dev/shm
> or similar) and remap the same memory that is already configured to
> the guest via Qemu. We have a pending work item for muser.ko to also
> keep the eventfds so we can recover those, too. Another advantage is
> working with any userspace driver and not requiring a VMM at all.
> 
> If done entirely in userspace, the device emulator needs to allocate
> the device memory somewhere that remains accessible (eg. tmpfs), with
> the difference that now we may be talking about non-trivial amounts of
> memory. Also, that may not be the kind of content you want lingering
> around the filesystem (for the same reasons Qemu unlinks memory files
> from /dev/hugepages after mmap'ing it).
> 
> That's why I'd prefer to rephrase what you said to "in addition"
> instead of "instead".
> 
> > * VMMs can implement this protocol directly for POSIX-portable and
> >   unprivileged operation.
> > * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
> >   still take advantage of libmuser devices.
> 
> I'm happy with that.
> We need to think the credential aspect throughout to ensure nodes can
> be created in the right places with the right privileges.
> 
> > 
> > Assuming this is feasible, would you lose any important
> > features/advantages of the muser.ko approach?  I don't know enough about
> > VFIO to identify any blocker or obvious performance problems.
> 
> That's what I elaborated above. The fact that muser.ko can keep
> various metadata (and other resources) about the device in the kernel
> and grant it back to userspace as needed. There are ways around it,
> but it requires some orchestration with tmpfs and the VMM (only so
> much can be kept in tmpfs; the eventfds need to be retransmitted from
> the machine emulator on request).
> 
> Restarting is a critical aspect of this. One key use case for the
> project is to be able to emulate various devices from one process (for
> polling). That must be able to restart for upgrades or recovery.
> 
> > 
> > Regarding recovery, it seems straightforward to keep state in a tmpfs
> > file that can be reopened when the device is restarted.  I don't think
> > kernel code is necessary?
> 
> It adds a dependency, but isn't a show stopper. If we can work through
> permission issues, making sure the VMM can reconnect and retransmit
> eventfds and other state, then it should be ok.
> 
> To be clear: I'm very happy to have a userspace-only option for this,
> I just don't want to ditch the kernel module (yet, anyway). :)

If it doesn't create too large of a burden to support both, then I think
it is very desirable. IIUC, this is saying a kernel based solution as the
optimized/optimal solution, and userspace UNIX socket based option as the
generic "works everywhere" fallback solution.



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]