qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Open qcow2 on multiple hosts simultaneously.


From: Vivek Goyal
Subject: Re: Open qcow2 on multiple hosts simultaneously.
Date: Tue, 20 Jun 2023 16:32:13 -0400

On Mon, Jun 19, 2023 at 07:20:34PM +0200, kvaps wrote:
> Hi Kevin and the community,

[ CC Alberto, Alice, Stefan ]
> 
> I am designing a CSI driver for Kubernetes that allows efficient
> utilization of SAN (Storage Area Network) and supports thin
> provisioning, snapshots, and ReadWriteMany mode for block devices.

Hi Andrei,

Good to hear that. Alberto also has been working on a CSI driver
which makes use of qemu-storage-daemon and qcwo2 files either with
local storage or shared storage like NFS. At this point of time it
focusses on filesystem backends as that's where it is easiest to
manage qcow2 files. But I think that could be extended to support
block device backends (ex. LVM) too.

https://gitlab.com/subprovisioner/subprovisioner

This is still work in progress. But I think there might be some overlaps
in your work and subprovisoner project.

> 
> To implement this, I have explored several technologies such as
> traditional LVM, LVMThin (which does not support shared mode), and
> QCOW2 on top of block devices. This is the same approach to what oVirt
> uses for thin provisioning over shared LUN:
> 
> https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
> 
> Based on benchmark results, I found that the performance degradation
> of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> creating snapshots.
> 
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
> 
> Therefore, I have decided to use the same aproach for Kubernetes.

Hmm..., I will need to spned more time going through numbers and setup.
This result is little surprising to me though. If you are using
vduse, nbk, ublk kind of exports, that means all IO will go to kernel
first, then to userspace(qsd) and then back into kernel. But with
pure LVM based approach, I/O path is much shorter (user space to
kernel). Given that, its little surprising that qcow2 is still
faster as compared to LVM.

If you somehow managed to use vhost-user-blk export instead, then I/O
path is shorter for qcow2 as well and that might perform well.

> 
> But in Kubernetes, the storage system needs to be self-sufficient and
> not depended to the workload that uses it. Thus unlike oVirt, we have
> no option to use the libvirt interface of the running VM to invoke the
> live-migration. Instead, we should provide pure block device in
> ReadWriteMany mode, where the block device can be writable on multiple
> hosts simultaneously.
> 
> To achieve this, I decided to use the qemu-storage-daemon with the
> VDUSE backend.
> 
> Other technologies, such as NBD and UBLK, were also considered, and
> their benchmark results can be seen in the same document on the
> different sheet:
> 
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
> 
> Taking into account the performance, stability, and versatility, I
> concluded that VDUSE is the optimal choice. To connect the device in
> Kubernetes, the virtio-vdpa interface would be used, and the entire
> scheme could look like this:

NBD will be slow. I am curious to know how do UBLK and VDUSE block
compare. Technically there does not seem to be any reason by VDUSE
virtio-vdpa device will be faster as compared to ublk. But I could
be wrong.

What about vhost-user-blk export. Have you considered that? That
probably will be fastest.

> 
> 
> +---------------------+  +---------------------+
> | node1               |  | node2               |
> |                     |  |                     |
> |    +-----------+    |  |    +-----------+    |
> |    | /dev/vda  |    |  |    | /dev/vda  |    |
> |    +-----+-----+    |  |    +-----+-----+    |
> |          |          |  |          |          |
> |     virtio-vdpa     |  |     virtio-vdpa     |
> |          |          |  |          |          |
> |        vduse        |  |        vduse        |
> |          |          |  |          |          |
> | qemu-storage-daemon |  | qemu-storage-daemon |
> |          |          |  |          |          |
> | +------- | -------+ |  | +------- | -------+ |
> | | LUN    |        | |  | | LUN    |        | |
> | |  +-----+-----+  | |  | |  +-----+-----+  | |
> | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> | |  +-----------+  | |  | |  +-----------+  | |
> | +--------+--------+ |  | +--------+--------+ |
> |          |          |  |          |          |
> |          |          |  |          |          |
> +--------- | ---------+  +--------- | ---------+
>            |                        |
>            |         +-----+        |
>            +---------| SAN |--------+
>                      +-----+
> 
> Despite two independent instances of qemu-storage-daemon for same
> qcow2 disk running successfully on different hosts, I have concerns
> about their proper functioning. Similar to live migration, I think
> they should share the state between each other.

Is it same LV on both the nodes? How are you activating same LV on
two nodes? IIUC, LVM does not allow that.

> 
> The question is how to make qemu-storage-daemon to share the state
> between multiple nodes, or is qcow2 format inherently stateless and
> does not requires this?

That's a good question. For simplicity we could think of NFS backed
storage and a qcow2 file providing storage. Can two QSD instances
actually work with same qcow2 file?

I am not sure this can be made to work with writable storage. Read-only
storage, probably yes.

For example, even if QSD could handle that, we will be having some
local filesystem visible to client on this block device (say
ext4/xfs/btrfs). These are built for one user and they don't expect
any other client is changing the blocks at the same time.

So I am not sure how one can export ReadWriteMany volumes using qcow2
or LVM for that matter. We probably need a shared filesystem for that
(NFS, GFS etc).

Am I missing something?

Thanks
Vivek




reply via email to

[Prev in Thread] Current Thread [Next in Thread]