qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Open qcow2 on multiple hosts simultaneously.


From: Stefan Hajnoczi
Subject: Re: Open qcow2 on multiple hosts simultaneously.
Date: Wed, 21 Jun 2023 11:15:20 +0200

On Tue, Jun 20, 2023 at 04:32:13PM -0400, Vivek Goyal wrote:
> On Mon, Jun 19, 2023 at 07:20:34PM +0200, kvaps wrote:
> > Hi Kevin and the community,
> 
> [ CC Alberto, Alice, Stefan ]
> > 
> > I am designing a CSI driver for Kubernetes that allows efficient
> > utilization of SAN (Storage Area Network) and supports thin
> > provisioning, snapshots, and ReadWriteMany mode for block devices.
> 
> Hi Andrei,
> 
> Good to hear that. Alberto also has been working on a CSI driver
> which makes use of qemu-storage-daemon and qcwo2 files either with
> local storage or shared storage like NFS. At this point of time it
> focusses on filesystem backends as that's where it is easiest to
> manage qcow2 files. But I think that could be extended to support
> block device backends (ex. LVM) too.
> 
> https://gitlab.com/subprovisioner/subprovisioner
> 
> This is still work in progress. But I think there might be some overlaps
> in your work and subprovisoner project.
> 
> > 
> > To implement this, I have explored several technologies such as
> > traditional LVM, LVMThin (which does not support shared mode), and
> > QCOW2 on top of block devices. This is the same approach to what oVirt
> > uses for thin provisioning over shared LUN:
> > 
> > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
> > 
> > Based on benchmark results, I found that the performance degradation
> > of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> > creating snapshots.
> > 
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
> > 
> > Therefore, I have decided to use the same aproach for Kubernetes.
> 
> Hmm..., I will need to spned more time going through numbers and setup.
> This result is little surprising to me though. If you are using
> vduse, nbk, ublk kind of exports, that means all IO will go to kernel
> first, then to userspace(qsd) and then back into kernel. But with
> pure LVM based approach, I/O path is much shorter (user space to
> kernel). Given that, its little surprising that qcow2 is still
> faster as compared to LVM.
> 
> If you somehow managed to use vhost-user-blk export instead, then I/O
> path is shorter for qcow2 as well and that might perform well.
> 
> > 
> > But in Kubernetes, the storage system needs to be self-sufficient and
> > not depended to the workload that uses it. Thus unlike oVirt, we have
> > no option to use the libvirt interface of the running VM to invoke the
> > live-migration. Instead, we should provide pure block device in
> > ReadWriteMany mode, where the block device can be writable on multiple
> > hosts simultaneously.
> > 
> > To achieve this, I decided to use the qemu-storage-daemon with the
> > VDUSE backend.
> > 
> > Other technologies, such as NBD and UBLK, were also considered, and
> > their benchmark results can be seen in the same document on the
> > different sheet:
> > 
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
> > 
> > Taking into account the performance, stability, and versatility, I
> > concluded that VDUSE is the optimal choice. To connect the device in
> > Kubernetes, the virtio-vdpa interface would be used, and the entire
> > scheme could look like this:
> 
> NBD will be slow. I am curious to know how do UBLK and VDUSE block
> compare. Technically there does not seem to be any reason by VDUSE
> virtio-vdpa device will be faster as compared to ublk. But I could
> be wrong.
> 
> What about vhost-user-blk export. Have you considered that? That
> probably will be fastest.
> 
> > 
> > 
> > +---------------------+  +---------------------+
> > | node1               |  | node2               |
> > |                     |  |                     |
> > |    +-----------+    |  |    +-----------+    |
> > |    | /dev/vda  |    |  |    | /dev/vda  |    |
> > |    +-----+-----+    |  |    +-----+-----+    |
> > |          |          |  |          |          |
> > |     virtio-vdpa     |  |     virtio-vdpa     |
> > |          |          |  |          |          |
> > |        vduse        |  |        vduse        |
> > |          |          |  |          |          |
> > | qemu-storage-daemon |  | qemu-storage-daemon |
> > |          |          |  |          |          |
> > | +------- | -------+ |  | +------- | -------+ |
> > | | LUN    |        | |  | | LUN    |        | |
> > | |  +-----+-----+  | |  | |  +-----+-----+  | |
> > | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> > | |  +-----------+  | |  | |  +-----------+  | |
> > | +--------+--------+ |  | +--------+--------+ |
> > |          |          |  |          |          |
> > |          |          |  |          |          |
> > +--------- | ---------+  +--------- | ---------+
> >            |                        |
> >            |         +-----+        |
> >            +---------| SAN |--------+
> >                      +-----+
> > 
> > Despite two independent instances of qemu-storage-daemon for same
> > qcow2 disk running successfully on different hosts, I have concerns
> > about their proper functioning. Similar to live migration, I think
> > they should share the state between each other.
> 
> Is it same LV on both the nodes? How are you activating same LV on
> two nodes? IIUC, LVM does not allow that.
> 
> > 
> > The question is how to make qemu-storage-daemon to share the state
> > between multiple nodes, or is qcow2 format inherently stateless and
> > does not requires this?
> 
> That's a good question. For simplicity we could think of NFS backed
> storage and a qcow2 file providing storage. Can two QSD instances
> actually work with same qcow2 file?
> 
> I am not sure this can be made to work with writable storage. Read-only
> storage, probably yes.
> 
> For example, even if QSD could handle that, we will be having some
> local filesystem visible to client on this block device (say
> ext4/xfs/btrfs). These are built for one user and they don't expect
> any other client is changing the blocks at the same time.
> 
> So I am not sure how one can export ReadWriteMany volumes using qcow2
> or LVM for that matter. We probably need a shared filesystem for that
> (NFS, GFS etc).
> 
> Am I missing something?

QEMU's qcow2 implementation only supports one writer. Live migration is
a slight exception, it takes care that only one host writes at any given
time and also drops caches to avoid stale metadata during migration.

QEMU's "raw" block driver has offset/size parameters, so a ReadWriteMany
LUN can be divided into smaller block devices and each host can run its
own qcow2 inside. The problem with this approach is fragmentation and
ultimately there must be a component in the system that is aware of
where each host has allocated its slice of the LUN.

So there is definitely something missing if you want to coordinate
between hosts sharing the same ReadWriteMany LUN. I'm inclined to
investigate existing solutions like pNFS or cLVM instead of reinventing
this for QEMU, because I suspect it seems simple at first but actually
involves a lot of work.

If you implement something from scratch, then it might be possible to
take advantage of Kubernetes. For example, a Custom Resource Definition
that describes extents allocated from a shared LUN. Space allocation
could be persisted at the k8s cluster level via this CRD so that you
don't need to reimplement clustering yourself. The CSI plugin constructs
qemu-storage-daemon command-lines from the extent information. That way,
each node in the cluster can run qemu-storage-daemon on the
ReadWriteMany LUN and coordinate which extents are allocated for a given
slice.

Anyway, this is very interesting work and close to what Alberto Faria
has been exploring with the subprovisioner CSI plugin.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]