qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC v4 PATCH 48/49] multi-process: add the concept description to d


From: Elena Ufimtseva
Subject: Re: [RFC v4 PATCH 48/49] multi-process: add the concept description to docs/devel/qemu-multiprocess
Date: Fri, 25 Oct 2019 12:33:51 -0700
User-agent: Mutt/1.9.4 (2018-02-28)

On Thu, Oct 24, 2019 at 05:09:29AM -0400, Jagannathan Raman wrote:
> From: John G Johnson <address@hidden>
> 
> Signed-off-by: John G Johnson <address@hidden>
> Signed-off-by: Elena Ufimtseva <address@hidden>
> Signed-off-by: Jagannathan Raman <address@hidden>
> ---
>  v2 -> v3:
>    - Updated with latest design of this project
> 
>  v3 -> v4:
>   - Updated document to RST format
>

Hi,

The warning was reported in regards to this patch because the index for the 
multi-process
document is incorrect as pointed by the automated tests.

"/tmp/qemu-test/src/docs/devel/index.rst:13:toctree contains reference to 
nonexisting document 'multi-process'".

The correct version of this patch is available. Should that be sent in the next 
series or can
be correct version attached here?

Thank you!

Elena, Jag and JJ.  
>  docs/devel/index.rst             |    1 +
>  docs/devel/qemu-multiprocess.rst | 1102 
> ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 1103 insertions(+)
>  create mode 100644 docs/devel/qemu-multiprocess.rst
> 
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index 1ec61fc..edd3fe3 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -22,3 +22,4 @@ Contents:
>     decodetree
>     secure-coding-practices
>     tcg
> +   multi-process
> diff --git a/docs/devel/qemu-multiprocess.rst 
> b/docs/devel/qemu-multiprocess.rst
> new file mode 100644
> index 0000000..2c42c6e
> --- /dev/null
> +++ b/docs/devel/qemu-multiprocess.rst
> @@ -0,0 +1,1102 @@
> +Disaggregating QEMU
> +===================
> +
> +QEMU is often used as the hypervisor for virtual machines running in the
> +Oracle cloud. Since one of the advantages of cloud computing is the
> +ability to run many VMs from different tenants in the same cloud
> +infrastructure, a guest that compromised its hypervisor could
> +potentially use the hypervisor's access privileges to access data it is
> +not authorized for.
> +
> +QEMU can be susceptible to security attack because it is a large,
> +monolithic program that provides many features to the VMs it services.
> +Many of these feature can be configured out of QEMU, but even a reduced
> +configuration QEMU has a large amount of code a guest can potentially
> +attack in order to gain additional privileges.
> +
> +QEMU services
> +-------------
> +
> +QEMU can be broadly described as providing three main services. One is a
> +VM control point, where VMs can be created, migrated, re-configured, and
> +destroyed. A second is to emulate the CPU instructions within the VM,
> +often accelerated by HW virtualization features such as Intel's VT
> +extensions. Finally, it provides IO services to the VM by emulating HW
> +IO devices, such as disk and network devices.
> +
> +A disaggregated QEMU
> +~~~~~~~~~~~~~~~~~~~~
> +
> +A disaggregated QEMU involves separating QEMU services into separate
> +host processes. Each of these processes can be given only the privileges
> +it needs to provide its service, e.g., a disk service could be given
> +access only the the disk images it provides, and not be allowed to
> +access other files, or any network devices. An attacker who compromised
> +this service would not be able to use this exploit to access files or
> +devices beyond what the disk service was given access to.
> +
> +A QEMU control process would remain, but in disaggregated mode, it would
> +be a control point that executes the processes needed to support the VM
> +being created, but have no direct interfaces to the VM. During VM
> +execution, it would still provide the user interface to hot-plug devices
> +or live migrate the VM.
> +
> +A first step in creating a disaggregated QEMU is to separate IO services
> +from the main QEMU program, which would continue to provide CPU
> +emulation. i.e., the control process would also be the CPU emulation
> +process. In a later phase, CPU emulation could be separated from the
> +control process.
> +
> +Disaggregating IO services
> +--------------------------
> +
> +Disaggregating IO services is a good place to begin QEMU disaggregating
> +for a couple of reasons. One is the sheer number of IO devices QEMU can
> +emulate provides a large surface of interfaces which could potentially
> +be exploited, and, indeed, have been a source of exploits in the past.
> +Another is the modular nature of QEMU device emulation code provides
> +interface points where the QEMU functions that perform device emulation
> +can be separated from the QEMU functions that manage the emulation of
> +guest CPU instructions.
> +
> +QEMU device emulation
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +QEMU uses a object oriented SW architecture for device emulation code.
> +Configured objects are all compiled into the QEMU binary, then objects
> +are instantiated by name when used by the guest VM. For example, the
> +code to emulate a device named "foo" is always present in QEMU, but its
> +instantiation code is only run when the device is included in the target
> +VM. (e.g., via the QEMU command line as *-device foo*)
> +
> +The object model is hierarchical, so device emulation code names its
> +parent object (such as "pci-device" for a PCI device) and QEMU will
> +instantiate a parent object before calling the device's instantiation
> +code.
> +
> +Current separation models
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In order to separate the device emulation code from the CPU emulation
> +code, the device object code must run in a different process. There are
> +a couple of existing QEMU features that can run emulation code
> +separately from the main QEMU process. These are examined below.
> +
> +vhost user model
> +^^^^^^^^^^^^^^^^
> +
> +Virtio guest device drivers can be connected to vhost user applications
> +in order to perform their IO operations. This model uses special virtio
> +device drivers in the guest and vhost user device objects in QEMU, but
> +once the QEMU vhost user code has configured the vhost user application,
> +mission-mode IO is performed by the application. The vhost user
> +application is a daemon process that can be contacted via a known UNIX
> +domain socket.
> +
> +vhost socket
> +''''''''''''
> +
> +As mentioned above, one of the tasks of the vhost device object within
> +QEMU is to contact the vhost application and send it configuration
> +information about this device instance. As part of the configuration
> +process, the application can also be sent other file descriptors over
> +the socket, which then can be used by the vhost user application in
> +various ways, some of which are described below.
> +
> +vhost MMIO store acceleration
> +'''''''''''''''''''''''''''''
> +
> +VMs are often run using HW virtualization features via the KVM kernel
> +driver. This driver allows QEMU to accelerate the emulation of guest CPU
> +instructions by running the guest in a virtual HW mode. When the guest
> +executes instructions that cannot be executed by virtual HW mode,
> +execution returns to the KVM driver so it can inform QEMU to emulate the
> +instructions in SW.
> +
> +One of the events that can cause a return to QEMU is when a guest device
> +driver accesses an IO location. QEMU then dispatches the memory
> +operation to the corresponding QEMU device object. In the case of a
> +vhost user device, the memory operation would need to be sent over a
> +socket to the vhost application. This path is accelerated by the QEMU
> +virtio code by setting up an eventfd file descriptor that the vhost
> +application can directly receive MMIO store notifications from the KVM
> +driver, instead of needing them to be sent to the QEMU process first.
> +
> +vhost interrupt acceleration
> +''''''''''''''''''''''''''''
> +
> +Another optimization used by the vhost application is the ability to
> +directly inject interrupts into the VM via the KVM driver, again,
> +bypassing the need to send the interrupt back to the QEMU process first.
> +The QEMU virtio setup code configures the KVM driver with an eventfd
> +that triggers the device interrupt in the guest when the eventfd is
> +written. This irqfd file descriptor is then passed to the vhost user
> +application program.
> +
> +vhost access to guest memory
> +''''''''''''''''''''''''''''
> +
> +The vhost application is also allowed to directly access guest memory,
> +instead of needing to send the data as messages to QEMU. This is also
> +done with file descriptors sent to the vhost user application by QEMU.
> +These descriptors can be passed to ``mmap()`` by the vhost application
> +to map the guest address space into the vhost application.
> +
> +IOMMUs introduce another level of complexity, since the address given to
> +the guest virtio device to DMA to or from is not a guest physical
> +address. This case is handled by having vhost code within QEMU register
> +as a listener for IOMMU mapping changes. The vhost application maintains
> +a cache of IOMMMU translations: sending translation requests back to
> +QEMU on cache misses, and in turn receiving flush requests from QEMU
> +when mappings are purged.
> +
> +applicability to device separation
> +''''''''''''''''''''''''''''''''''
> +
> +Much of the vhost model can be re-used by separated device emulation. In
> +particular, the ideas of using a socket between QEMU and the device
> +emulation application, using a file descriptor to inject interrupts into
> +the VM via KVM, and allowing the application to ``mmap()`` the guest
> +should be re used.
> +
> +There are, however, some notable differences between how a vhost
> +application works and the needs of separated device emulation. The most
> +basic is that vhost uses custom virtio device drivers which always
> +trigger IO with MMIO stores. A separated device emulation model must
> +work with existing IO device models and guest device drivers. MMIO loads
> +break vhost store acceleration since they are synchronous - guest
> +progress cannot continue until the load has been emulated. By contrast,
> +stores are asynchronous, the guest can continue after the store event
> +has been sent to the vhost application.
> +
> +Another difference is that in the vhost user model, a single daemon can
> +support multiple QEMU instances. This is contrary to the security regime
> +desired, in which the emulation application should only be allowed to
> +access the files or devices the VM it's running on behalf of can access.
> +#### qemu-io model
> +
> +Qemu-io is a test harness used to test changes to the QEMU block backend
> +object code. (e.g., the code that implements disk images for disk driver
> +emulation) Qemu-io is not a device emulation application per se, but it
> +does compile the QEMU block objects into a separate binary from the main
> +QEMU one. This could be useful for disk device emulation, since its
> +emulation applications will need to include the QEMU block objects.
> +
> +New separation model based on proxy objects
> +-------------------------------------------
> +
> +A different model based on proxy objects in the QEMU program
> +communicating with remote emulation programs could provide separation
> +while minimizing the changes needed to the device emulation code. The
> +rest of this section is a discussion of how a proxy object model would
> +work.
> +
> +Remote emulation processes
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The remote emulation process will run the QEMU object hierarchy without
> +modification. The device emulation objects will be also be based on the
> +QEMU code, because for anything but the simplest device, it would not be
> +a tractable to re-implement both the object model and the many device
> +backends that QEMU has.
> +
> +The processes will communicate with the QEMU process over UNIX domain
> +sockets. The processes can be executed either as standalone processes,
> +or be executed by QEMU. In both cases, the host backends the emulation
> +processes will provide are specified on its command line, as they would
> +be for QEMU. For example:
> +
> +::
> +
> +    disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0  \
> +    -blockdev driver=qcow2,node-name=drive0,file=file0
> +
> +would indicate process *disk-proc* uses a qcow2 emulated disk named
> +*file0* as its backend.
> +
> +Emulation processes may emulate more than one guest controller. A common
> +configuration might be to put all controllers of the same device class
> +(e.g., disk, network, etc.) in a single process, so that all backends of
> +the same type can be managed by a single QMP monitor.
> +
> +communication with QEMU
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Remote emulation processes will recognize a *-socket* argument that
> +specifies the path of a UNIX domain socket used to communicate with the
> +QEMU process. If no *-socket* argument is present, the process will use
> +file descriptor 0 to communicate with QEMU. For example,
> +
> +::
> +
> +    disk-proc -socket /tmp/disk0-sock <backend list>
> +
> +will communicate with QEMU using the socket path */tmp/dik0-sock*.
> +
> +remote process QMP monitor
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Remote emulation processes can be monitored via QMP, similar to QEMU
> +itself. The QMP monitor socket is specified the same as for a QEMU
> +process:
> +
> +::
> +
> +    disk-proc -qmp unix:/tmp/disk-mon,server
> +
> +can be monitored over the UNIX socket path */tmp/disk-mon*.
> +
> +QEMU command line
> +~~~~~~~~~~~~~~~~~
> +
> +The QEMU command line options will need to be modified to indicate which
> +items are emulated by a separate program, and which remain emulated by
> +QEMU itself.
> +
> +identifying remote emulation processes
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Remote emulation processes will be identified to QEMU using a *-remote*
> +command line option. This option can either specify a command that QEMU
> +will execute, or can specify a UNIX domain socket that QEMU can use to
> +connect to an existing process. Both forms require a "id" option that
> +identifies the process to later *-device* options. The process version
> +is:
> +
> +::
> +
> +    -remote id=disk-proc,command="disk-proc <backend list>"
> +
> +And the socket version is:
> +
> +::
> +
> +    -remote id=disk-proc,socket="/tmp/disk0-sock"
> +
> +In the latter case, the remote process must be given the same socket on
> +its command line when it is executed:
> +
> +::
> +
> +    disk-proc -socket /tmp/disk0-sock <backend list>
> +
> +identifying devices emulated remotely
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Devices that are to be emulated in a separate process will be identify
> +the remote process with a "remote" option on their *-device* command
> +line specification. e.g., an LSI SCSI controller and disk can be
> +specified as:
> +
> +::
> +
> +    -device lsi53c895a,id=scsi0
> +    -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
> +
> +If these devices are emulated by remote process "disk-proc," as
> +described in the previous section, the QEMU command line would be:
> +
> +::
> +
> +    -device lsi53c895a,id=scsi0,remote=disk-proc
> +    -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0,remote=disk-proc
> +
> +Some devices are implicitly created by the machine object. e.g., the q35
> +machine object will create its PCI bus, and attach an ich9-ahci IDE
> +controller to it. In this case, options will need to be added to the
> +*-machine* command line. e.g.,
> +
> +::
> +
> +    -machine pc-q35,ide-remote=disk-proc
> +
> +will use the remote process with an "id" of "disk-proc" to emulate the
> +IDE controller and its disks.
> +
> +The disks themselves still need to be specified with *-remote* option,
> +as in the example above. e.g.,
> +
> +::
> +
> +    -device ide-hd,drive=drive0,bus=ide.0,unit=0,remote=disk-proc
> +
> +QEMU management of remote processes
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Each *-remote* instance on the QEMU command line will create a remote
> +process proxy instance in QEMU. They will be held on a *QList* that can
> +be searched for by its "id" property. The remote process proxy will also
> +establish a communication channel between QEMU and the remote process.
> +This can be done in one of two methods: direction execution of the
> +process by QEMU with ``fork()`` and ``exec()`` system calls, or by
> +connecting to an existing process.
> +
> +direct execution
> +^^^^^^^^^^^^^^^^
> +
> +When the remote process is directly executed, the remote process proxy
> +will setup a communication channel between itself and the emulation
> +process. This channel will be created using ``socketpair()`` and the
> +remote process side of the pair will be given to the process as file
> +descriptor 0.
> +
> +connecting to an existing process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Some environments wish to deny QEMU the ability to execute ``fork()``
> +and ``exec()`` In these case, emulation processes will be started before
> +QEMU, and a UNIX domain socket will be given to each emulation process
> +to communicate with QEMU over. After communication is established, the
> +socket will be unlinked from the file system space by the QEMU process.
> +
> +communication with emulation process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +primary socket
> +''''''''''''''
> +
> +Whether the process was executed by QEMU or externally, there will be a
> +primary socket for communication between QEMU and the remote process.
> +This channel will handle configuration commands from QEMU to the
> +process, either from the QEMU command line, or from QMP commands that
> +affect the devices being emulated by the process. This channel will only
> +allow one message to be pending at a time; if additional messages
> +arrive, they must wait for previous ones to be acknowledged from the
> +remote side.
> +
> +secondary sockets
> +'''''''''''''''''
> +
> +The primary socket can pass the file descriptors of secondary sockets
> +for operations that occur in parallel with commands on the primary
> +channel. These include MMIO operations generated by the guest, interrupt
> +notifications generated by the devices being emulated, or *vmstate* for
> +live migration. These secondary sockets will be created at the behest of
> +the device proxies that require them. A disk device proxy wouldn't need
> +any secondary sockets, but a disk controller device proxy may need both
> +an MMIO socket and an interrupt socket.
> +
> +emulation process attached via QMP command
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +There will be a new "attach-process" QMP command to facilitate device
> +hot-plug. This command's arguments will be the same as the *-remote*
> +command line when it's used to attach to a remote process. i.e., it will
> +need an "id" argument so that hot-plugged devices can later find it, and
> +a "socket" argument to identify the UNIX domain socket that will be used
> +to communicate with QEMU.
> +
> +QEMU device proxy objects
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +QEMU has an object model based on sub-classes inherited from the
> +"object" super-class. The sub-classes that are of interest here are the
> +"device" and "bus" sub-classes whose child sub-classes make up the
> +device tree of a QEMU emulated system.
> +
> +The proxy object model will use device proxy objects to replace the
> +device emulation code within the QEMU process. These objects will live
> +in the same place in the object and bus hierarchies as the objects they
> +replace. i.e., the proxy object for an LSI SCSI controller will be a
> +sub-class of the "pci-device" class, and will have the same PCI bus
> +parent and the same SCSI bus child objects as the LSI controller object
> +it replaces.
> +
> +After the QEMU command line has been parsed, the remote devices will be
> +instantiated in the same manner as local devices are. (i.e.,
> +``qdev_device_add()``). In order to distinguish them from regular
> +*-device* device objects, their class name will be the name of the class
> +it replaces, with "-proxy" appended. e.g., the "lsi53c895a" proxy class
> +will be "lsi53c895a-proxy."
> +
> +device JSON description
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The remote process needs a JSON representation of the command line
> +options used to create the object. This JSON representation is used to
> +create the corresponding object in the emulation process. e.g., for an
> +LSI SCSI controller invoked as:
> +
> +::
> +
> +     -device lsi53c895a,id=scsi0,remote=lsi-scsi
> +
> +the proxy object would create a
> +
> +::
> +
> +    { "driver" : "lsi53c895a", "id" : "scsi0" }
> +
> +JSON description. The "driver" option is assigned to the device name
> +when the command line is parsed, so the "-proxy" appended by the command
> +line parsing code is removed. The "remote" option isn't needed in the
> +JSON description since it only applies to the proxy object in the QEMU
> +process.
> +
> +device object whitelist
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Some device objects may not need a proxy. These are devices with no
> +direct guest interfaces. (e.g., no MMIO, PIO, or interrupts). There will
> +be a whitelist of such devices, and any devices on this list will not be
> +instantiated in QEMU. Their JSON representation will still be sent to
> +the remote process, so the object can be created there.
> +
> +object initialization
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +QEMU object initialization occurs in two phases. The first
> +initialization happens once per object class. (i.e., there can be many
> +SCSI disks in an emulated system, but the "scsi-hd" class has its
> +``class_init()`` function called only once) The second phase happens
> +when each object's ``instance_init()`` function is called to initialize
> +each instance of the object.
> +
> +All device objects are sub-classes of the "device" class, so they also
> +have a ``realize()`` function that is called after ``instance_init()``
> +is called and after the object's static properties have been
> +initialized. Many device objects don't even provide an instance\_init()
> +function, and do all their per-instance work in ``realize()``.
> +
> +class\_init
> +'''''''''''
> +
> +The ``class_init()`` method of a proxy object will, in general behave
> +similarly to the object it replaces, including setting any static
> +properties and methods needed by the proxy.
> +
> +instance\_init / realize
> +''''''''''''''''''''''''
> +
> +The ``instance_init()`` and ``realize()`` functions would only need to
> +perform tasks related to being a proxy, such are registering its own
> +MMIO handlers, or creating a child bus that other proxy devices can be
> +attached to later.
> +
> +Other tasks will are device-specific. For example, PCI device objects
> +will initialize the PCI config space in order to make a valid PCI device
> +tree within the QEMU process.
> +
> +address space registration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Most devices are driven by guest device driver accesses to IO addresses
> +or ports. The QEMU device emulation code uses QEMU's memory region
> +function calls (such as ``memory_region_init_io()``) to add callback
> +functions that QEMU will invoke when the guest accesses the device's
> +areas of the IO address space. When a guest driver does access the
> +device, the VM will exit HW virtualization mode and return to QEMU,
> +which will then lookup and execute the corresponding callback function.
> +
> +A proxy object would need to mirror the memory region calls the actual
> +device emulator would perform in its initialization code, but with its
> +own callbacks. When invoked by QEMU as a result of a guest IO operation,
> +they will forward the operation to the device emulation process.
> +
> +PCI config space
> +^^^^^^^^^^^^^^^^
> +
> +PCI devices also have a configuration space that can be accessed by the
> +guest driver. Guest accesses to this space is not handled by the device
> +emulation object, but by its PCI parent object. Much of this space is
> +read-only, but certain registers (especially BAR and MSI-related ones)
> +need to be propagated to the emulation process.
> +
> +PCI parent proxy
> +''''''''''''''''
> +
> +One way to propagate guest PCI config accesses is to create a
> +"pci-device-proxy" class that can serve as the parent of a PCI device
> +proxy object. This class's parent would be "pci-device" and it would
> +override the PCI parent's ``config_read()`` and ``config_write()``
> +methods with ones that forward these operations to the emulation
> +program.
> +
> +interrupt receipt
> +^^^^^^^^^^^^^^^^^
> +
> +A proxy for a device that generates interrupts will need to create a
> +socket to receive interrupt indications from the emulation process. An
> +incoming interrupt indication would then be sent up to its bus parent to
> +be injected into the guest. For example, a PCI device object may use
> +``pci_set_irq()``.
> +
> +live migration
> +^^^^^^^^^^^^^^
> +
> +The proxy will register to save and restore any *vmstate* it needs over
> +a live migration event. The device proxy does not need to manage the
> +remote device's *vmstate*; that will be handled by the remote process
> +proxy (see below).
> +
> +QEMU remote device operation
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Generic device operations, such as DMA, will be performs by the remote
> +process proxy by sending messages to the remote process.
> +
> +DMA operations
> +^^^^^^^^^^^^^^
> +
> +DMA operations would be handled much like vhost applications do. One of
> +the initial messages sent to the emulation process is a guest memory
> +table. Each entry in this table consists of a file descriptor and size
> +that the emulation process can ``mmap()`` to directly access guest
> +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
> +must be backed by file descriptors, such as when QEMU is given the
> +*-mem-path* command line option.
> +
> +IOMMU operations
> +^^^^^^^^^^^^^^^^
> +
> +When the emulated system includes an IOMMU, the remote process proxy in
> +QEMU will need to create a socket for IOMMU requests from the emulation
> +process. It will handle those requests with an
> +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
> +unmaps, the remote process proxy will also register as a listener on the
> +device's DMA address space. When an IOMMU memory region is created
> +within the DMA address space, an IOMMU notifier for unmaps will be added
> +to the memory region that will forward unmaps to the emulation process
> +over the IOMMU socket.
> +
> +device hot-plug via QMP
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +An QMP "device\_add" command can add a device emulated by a remote
> +process. It needs to add a "remote" option to the command, just as the
> +*-device* command line option does. The remote process may either be one
> +started at QEMU startup, or be one added by the "add-process" QMP
> +command described above. In either case, the remote process proxy will
> +forward the new device's JSON description to the corresponding emulation
> +process.
> +
> +live migration
> +^^^^^^^^^^^^^^
> +
> +The remote process proxy will also register for live migration
> +notifications with ``vmstate_register()``. When called to save state,
> +the proxy will send the remote process a secondary socket file
> +descriptor to save the remote process's device *vmstate* over. The
> +incoming byte stream length and data will be saved as the proxy's
> +*vmstate*. When the proxy is resumed on its new host, this *vmstate*
> +will be extracted, and a secondary socket file descriptor will be sent
> +to the new remote process through which it receives the *vmstate* in
> +order to restore the devices there.
> +
> +device emulation in remote process
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The parts of QEMU that the emulation program will need include the
> +object model; the memory emulation objects; the device emulation objects
> +of the targeted device, and any dependent devices; and, the device's
> +backends. It will also need code to setup the machine environment,
> +handle requests from the QEMU process, and route machine-level requests
> +(such as interrupts or IOMMU mappings) back to the QEMU process.
> +
> +initialization
> +''''''''''''''
> +
> +The process initialization sequence will follow the same sequence
> +followed by QEMU. It will first initialize the backend objects, then
> +device emulation objects. The JSON descriptions sent by the QEMU process
> +will drive which objects need to be created.
> +
> +-  address spaces
> +
> +Before the device objects are created, the initial address spaces and
> +memory regions must be configured with ``memory_map_init()``. This
> +creates a RAM memory region object (*system\_memory*) and an IO memory
> +region object (*system\_io*).
> +
> +-  RAM
> +
> +RAM memory region creation will follow how ``pc_memory_init()`` creates
> +them, but must use ``memory_region_init_ram_from_fd()`` instead of
> +``memory_region_allocate_system_memory()``. The file descriptors needed
> +will be supplied by the guest memory table from above. Those RAM regions
> +would then be added to the *system\_memory* memory region with
> +``memory_region_add_subregion()``.
> +
> +-  PCI
> +
> +IO initialization will be driven by the JSON descriptions sent from the
> +QEMU process. For a PCI device, a PCI bus will need to be created with
> +``pci_root_bus_new()``, and a PCI memory region will need to be created
> +and added to the *system\_memory* memory region with
> +``memory_region_add_subregion_overlap()``. The overlap version is
> +required for architectures where PCI memory overlaps with RAM memory.
> +
> +MMIO handling
> +'''''''''''''
> +
> +The device emulation objects will use ``memory_region_init_io()`` to
> +install their MMIO handlers, and ``pci_register_bar()`` to associate
> +those handlers with a PCI BAR, as they do within QEMU currently.
> +
> +In order to use ``address_space_rw()`` in the emulation process to
> +handle MMIO requests from QEMU, the PCI physical addresses must be the
> +same in the QEMU process and the device emulation process. In order to
> +accomplish that, guest BAR programming must also be forwarded from QEMU
> +to the emulation process.
> +
> +interrupt injection
> +'''''''''''''''''''
> +
> +When device emulation wants to inject an interrupt into the VM, the
> +request climbs the device's bus object hierarchy until the point where a
> +bus object knows how to signal the interrupt to the guest. The details
> +depend on the type of interrupt being raised.
> +
> +-  PCI pin interrupts
> +
> +On x86 systems, there is an emulated IOAPIC object attached to the root
> +PCI bus object, and the root PCI object forwards interrupt requests to
> +it. The IOAPIC object, in turn, calls the KVM driver to inject the
> +corresponding interrupt into the VM. The simplest way to handle this in
> +an emulation process would be to setup the root PCI bus driver (via
> +``pci_bus_irqs()``) to send a interrupt request back to the QEMU
> +process, and have the device proxy object reflect it up the PCI tree
> +there.
> +
> +-  PCI MSI/X interrupts
> +
> +PCI MSI/X interrupts are implemented in HW as DMA writes to a
> +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
> +these DMA writes, then calls into the KVM driver to inject the interrupt
> +into the VM. A simple emulation process implementation would be to send
> +the MSI DMA address from QEMU as a message at initialization, then
> +install an address space handler at that address which forwards the MSI
> +message back to QEMU.
> +
> +DMA operations
> +''''''''''''''
> +
> +When a emulation object wants to DMA into or out of guest memory, it
> +first must use dma\_memory\_map() to convert the DMA address to a local
> +virtual address. The emulation process memory region objects setup above
> +will be used to translate the DMA address to a local virtual address the
> +device emulation code can access.
> +
> +IOMMU
> +'''''
> +
> +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
> +regions to translate the DMA address to a guest physical address before
> +that physical address can be translated to a local virtual address. The
> +emulation process will need similar functionality.
> +
> +-  IOTLB cache
> +
> +The emulation process will maintain a cache of recent IOMMU translations
> +(the IOTLB). When the translate() callback of an IOMMU memory region is
> +invoked, the IOTLB cache will be searched for an entry that will map the
> +DMA address to a guest PA. On a cache miss, a message will be sent back
> +to QEMU requesting the corresponding translation entry, which be both be
> +used to return a guest address and be added to the cache.
> +
> +-  IOTLB purge
> +
> +The IOMMU emulation will also need to act on unmap requests from QEMU.
> +These happen when the guest IOMMU driver purges an entry from the
> +guest's translation table.
> +
> +live migration
> +''''''''''''''
> +
> +When a remote process receives a live migration indication from QEMU, it
> +will set up a channel using the received file descriptor with
> +``qio_channel_socket_new_fd()``. This channel will be used to create a
> +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
> +the process's device state back to QEMU. This method will be reversed on
> +restore - the channel will be passed to ``qemu_loadvm_state()`` to
> +restore the device state.
> +
> +Accelerating device emulation
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The messages that are required to be sent between QEMU and the emulation
> +process can add considerable latency to IO operations. The optimizations
> +described below attempt to ameliorate this effect by allowing the
> +emulation process to communicate directly with the kernel KVM driver.
> +The KVM file descriptors created wold be passed to the emulation process
> +via initialization messages, much like the guest memory table is done.
> +#### MMIO acceleration
> +
> +Vhost user applications can receive guest virtio driver stores directly
> +from KVM. The issue with the eventfd mechanism used by vhost user is
> +that it does not pass any data with the event indication, so it cannot
> +handle guest loads or guest stores that carry store data. This concept
> +could, however, be expanded to cover more cases.
> +
> +The expanded idea would require a new type of KVM device:
> +*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
> +descriptor that QEMU can use for configuration, and a slave descriptor
> +that the emulation process can use to receive MMIO notifications. QEMU
> +would create both descriptors using the KVM driver, and pass the slave
> +descriptor to the emulation process via an initialization message.
> +
> +data structures
> +'''''''''''''''
> +
> +-  guest physical range
> +
> +The guest physical range structure describes the address range that a
> +device will respond to. It includes the base and length of the range, as
> +well as which bus the range resides on (e.g., on an x86machine, it can
> +specify whether the range refers to memory or IO addresses).
> +
> +A device can have multiple physical address ranges it responds to (e.g.,
> +a PCI device can have multiple BARs), so the structure will also include
> +an enumerated identifier to specify which of the device's ranges is
> +being referred to.
> +
> ++--------+----------------------------+
> +| Name   | Description                |
> ++========+============================+
> +| addr   | range base address         |
> ++--------+----------------------------+
> +| len    | range length               |
> ++--------+----------------------------+
> +| bus    | addr type (memory or IO)   |
> ++--------+----------------------------+
> +| id     | range ID (e.g., PCI BAR)   |
> ++--------+----------------------------+
> +
> +-  MMIO request structure
> +
> +This structure describes an MMIO operation. It includes which guest
> +physical range the MMIO was within, the offset within that range, the
> +MMIO type (e.g., load or store), and its length and data. It also
> +includes a sequence number that can be used to reply to the MMIO, and
> +the CPU that issued the MMIO.
> +
> ++----------+------------------------+
> +| Name     | Description            |
> ++==========+========================+
> +| rid      | range MMIO is within   |
> ++----------+------------------------+
> +| offset   | offset withing *rid*   |
> ++----------+------------------------+
> +| type     | e.g., load or store    |
> ++----------+------------------------+
> +| len      | MMIO length            |
> ++----------+------------------------+
> +| data     | store data             |
> ++----------+------------------------+
> +| seq      | sequence ID            |
> ++----------+------------------------+
> +
> +-  MMIO request queues
> +
> +MMIO request queues are FIFO arrays of MMIO request structures. There
> +are two queues: pending queue is for MMIOs that haven't been read by the
> +emulation program, and the sent queue is for MMIOs that haven't been
> +acknowledged. The main use of the second queue is to validate MMIO
> +replies from the emulation program.
> +
> +-  scoreboard
> +
> +Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
> +MMIOs may be waiting to be consumed by an emulation program and multiple
> +threads may be waiting for MMIO replies. The scoreboard would contain a
> +wait queue and sequence number for the per-CPU threads, allowing them to
> +be individually woken when the MMIO reply is received from the emulation
> +program. It also tracks the number of posted MMIO stores to the device
> +that haven't been replied to, in order to satisfy the PCI constraint
> +that a load to a device will not complete until all previous stores to
> +that device have been completed.
> +
> +-  device shadow memory
> +
> +Some MMIO loads do not have device side-effects. These MMIOs can be
> +completed without sending a MMIO request to the emulation program if the
> +emulation program shares a shadow image of the device's memory image
> +with the KVM driver.
> +
> +The emulation program will ask the KVM driver to allocate memory for the
> +shadow image, and will then use ``mmap()`` to directly access it. The
> +emulation program can control KVM access to the shadow image by sending
> +KVM an access map telling it which areas of the image have no
> +side-effects (and can be completed immediately), and which require a
> +MMIO request to the emulation program. The access map can also inform
> +the KVM drive which size accesses are allowed to the image.
> +
> +master descriptor
> +'''''''''''''''''
> +
> +The master descriptor is used by QEMU to configure the new KVM device.
> +The descriptor would be returned by the KVM driver when QEMU issues a
> +*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
> +
> +KVM\_DEV\_TYPE\_USER device ops
> +
> +
> +The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
> +``kvm_register_device_ops()`` call when the KVM system in initialized by
> +``kvm_init()``. These device ops are called by the KVM driver when QEMU
> +executes certain ``ioctl()`` operations on its KVM file descriptor. They
> +include:
> +
> +-  create
> +
> +This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
> +``ioctl()`` on its per-VM file descriptor. It will allocate and
> +initialize a KVM user device specific data structure, and assign the
> +*kvm\_device* private field to it.
> +
> +-  ioctl
> +
> +This routine is invoked when QEMU issues an ``ioctl()`` on the master
> +descriptor. The ``ioctl()`` commands supported are defined by the KVM
> +device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
> +
> +*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor thatwill
> +be passed to the device emulation program. Only one slave can be created
> +by each master descriptor. The file operations performed by this
> +descriptor are described below.
> +
> +The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
> +address range that the slave descriptor will receive MMIO notifications
> +for. The range is specified by a guest physical range structure
> +argument. For buses that assign addresses to devices dynamically, this
> +command can be executed while the guest is running, such as the case
> +when a guest changes a device's PCI BAR registers.
> +
> +*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
> +register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
> +performs a MMIO operation within the range. When a range is changed,
> +``kvm_io_bus_unregister_dev()`` is used to remove the previous
> +instantiation.
> +
> +*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
> +how long KVM will wait for the emulation process to respond to a MMIO
> +indication.
> +
> +-  destroy
> +
> +This routine is called when the VM instance is destroyed. It will need
> +to destroy the slave descriptor; and free any memory allocated by the
> +driver, as well as the *kvm\_device* structure itself.
> +
> +slave descriptor
> +''''''''''''''''
> +
> +The slave descriptor will have its own file operations vector, which
> +responds to system calls on the descriptor performed by the device
> +emulation program.
> +
> +-  read
> +
> +A read returns any pending MMIO requests from the KVM driver as MMIO
> +request structures. Multiple structures can be returned if there are
> +multiple MMIO operations pending. The MMIO requests are moved from the
> +pending queue to the sent queue, and if there are threads waiting for
> +space in the pending to add new MMIO operations, they will be woken
> +here.
> +
> +-  write
> +
> +A write also consists of a set of MMIO requests. They are compared to
> +the MMIO requests in the sent queue. Matches are removed from the sent
> +queue, and any threads waiting for the reply are woken. If a store is
> +removed, then the number of posted stores in the per-CPU scoreboard is
> +decremented. When the number is zero, and a non side-effect load was
> +waiting for posted stores to complete, the load is continued.
> +
> +-  ioctl
> +
> +There are several ioctl()s that can be performed on the slave
> +descriptor.
> +
> +A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
> +allocate memory for the shadow image. This memory can later be
> +``mmap()``\ ed by the emulation process to share the emulation's view of
> +device memory with the KVM driver.
> +
> +A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
> +shadow image. It will send the KVM driver a shadow control map, which
> +specifies which areas of the image can complete guest loads without
> +sending the load request to the emulation program. It will also specify
> +the size of load operations that are allowed.
> +
> +-  poll
> +
> +An emulation program will use the ``poll()`` call with a *POLLIN* flag
> +to determine if there are MMIO requests waiting to be read. It will
> +return if the pending MMIO request queue is not empty.
> +
> +-  mmap
> +
> +This call allows the emulation program to directly access the shadow
> +image allocated by the KVM driver. As device emulation updates device
> +memory, changes with no side-effects will be reflected in the shadow,
> +and the KVM driver can satisfy guest loads from the shadow image without
> +needing to wait for the emulation program.
> +
> +kvm\_io\_device ops
> +'''''''''''''''''''
> +
> +Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
> +VM. KVM will use the MMIO's guest physical address to search for a
> +matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
> +driver instead of exiting back to QEMU. If a match is found, the
> +corresponding callback will be invoked.
> +
> +-  read
> +
> +This callback is invoked when the guest performs a load to the device.
> +Loads with side-effects must be handled synchronously, with the KVM
> +driver putting the QEMU thread to sleep waiting for the emulation
> +process reply before re-starting the guest. Loads that do not have
> +side-effects may be optimized by satisfying them from the shadow image,
> +if there are no outstanding stores to the device by this CPU. PCI memory
> +ordering demands that a load cannot complete before all older stores to
> +the same device have been completed.
> +
> +-  write
> +
> +Stores can be handled asynchronously unless the pending MMIO request
> +queue is full. In this case, the QEMU thread must sleep waiting for
> +space in the queue. Stores will increment the number of posted stores in
> +the per-CPU scoreboard, in order to implement the PCI ordering
> +constraint above.
> +
> +interrupt acceleration
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +This performance optimization would work much like a vhost user
> +application does, where the QEMU process sets up *eventfds* that cause
> +the device's corresponding interrupt to be triggered by the KVM driver.
> +These irq file descriptors are sent to the emulation process at
> +initialization, and are used when the emulation code raises a device
> +interrupt.
> +
> +intx acceleration
> +'''''''''''''''''
> +
> +Traditional PCI pin interrupts are level based, so, in addition to an
> +irq file descriptor, a re-sampling file descriptor needs to be sent to
> +the emulation program. This second file descriptor allows multiple
> +devices sharing an irq to be notified when the interrupt has been
> +acknowledged by the guest, so they can re-trigger the interrupt if their
> +device has not de-asserted its interrupt.
> +
> +intx irq descriptor
> +
> +
> +The irq descriptors are created by the proxy object
> +``using event_notifier_init()`` to create the irq and re-sampling
> +*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
> +The interrupt route can be found with
> +``pci_device_route_intx_to_irq()``.
> +
> +intx routing changes
> +
> +
> +Intx routing can be changed when the guest programs the APIC the device
> +pin is connected to. The proxy object in QEMU will use
> +``pci_device_set_intx_routing_notifier()`` to be informed of any guest
> +changes to the route. This handler will broadly follow the VFIO
> +interrupt logic to change the route: de-assigning the existing irq
> +descriptor from its route, then assigning it the new route. (see
> +``vfio_intx_update()``)
> +
> +MSI/X acceleration
> +''''''''''''''''''
> +
> +MSI/X interrupts are sent as DMA transactions to the host. The interrupt
> +data contains a vector that is programed by the guest, A device may have
> +multiple MSI interrupts associated with it, so multiple irq descriptors
> +may need to be sent to the emulation program.
> +
> +MSI/X irq descriptor
> +
> +
> +This case will also follow the VFIO example. For each MSI/X interrupt,
> +an *eventfd* is created, a virtual interrupt is allocated by
> +``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
> +the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
> +
> +MSI/X config space changes
> +
> +
> +The guest may dynamically update several MSI-related tables in the
> +device's PCI config space. These include per-MSI interrupt enables and
> +vector data. Additionally, MSIX tables exist in device memory space, not
> +config space. Much like the BAR case above, the proxy object must look
> +at guest config space programming to keep the MSI interrupt state
> +consistent between QEMU and the emulation program.
> +
> +--------------
> +
> +Disaggregated CPU emulation
> +---------------------------
> +
> +After IO services have been disaggregated, a second phase would be to
> +separate a process to handle CPU instruction emulation from the main
> +QEMU control function. There are no object separation points for this
> +code, so the first task would be to create one.
> +
> +Host access controls
> +--------------------
> +
> +Separating QEMU relies on the host OS's access restriction mechanisms to
> +enforce that the differing processes can only access the objects they
> +are entitled to. There are a couple types of mechanisms usually provided
> +by general purpose OSs.
> +
> +Discretionary access control
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Discretionary access control allows each user to control who can access
> +their files. In Linux, this type of control is usually too coarse for
> +QEMU separation, since it only provides three separate access controls:
> +one for the same user ID, the second for users IDs with the same group
> +ID, and the third for all other user IDs. Each device instance would
> +need a separate user ID to provide access control, which is likely to be
> +unwieldy for dynamically created VMs.
> +
> +Mandatory access control
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Mandatory access control allows the OS to add an additional set of
> +controls on top of discretionary access for the OS to control. It also
> +adds other attributes to processes and files such as types, roles, and
> +categories, and can establish rules for how processes and files can
> +interact.
> +
> +Type enforcement
> +^^^^^^^^^^^^^^^^
> +
> +Type enforcement assigns a *type* attribute to processes and files, and
> +allows rules to be written on what operations a process with a given
> +type can perform on a file with a given type. QEMU separation could take
> +advantage of type enforcement by running the emulation processes with
> +different types, both from the main QEMU process, and from the emulation
> +processes of different classes of devices.
> +
> +For example, guest disk images and disk emulation processes could have
> +types separate from the main QEMU process and non-disk emulation
> +processes, and the type rules could prevent processes other than disk
> +emulation ones from accessing guest disk images. Similarly, network
> +emulation processes can have a type separate from the main QEMU process
> +and non-network emulation process, and only that type can access the
> +host tun/tap device used to provide guest networking.
> +
> +Category enforcement
> +^^^^^^^^^^^^^^^^^^^^
> +
> +Category enforcement assigns a set of numbers within a given range to
> +the process or file. The process is granted access to the file if the
> +process's set is a superset of the file's set. This enforcement can be
> +used to separate multiple instances of devices in the same class.
> +
> +For example, if there are multiple disk devices provides to a guest,
> +each device emulation process could be provisioned with a separate
> +category. The different device emulation processes would not be able to
> +access each other's backing disk images.
> +
> +Alternatively, categories could be used in lieu of the type enforcement
> +scheme described above. In this scenario, different categories would be
> +used to prevent device emulation processes in different classes from
> +accessing resources assigned to other classes.
> -- 
> 1.8.3.1
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]