Re: RBD images and exclusive locking

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RBD images and exclusive locking

From:	Peter Lieven
Subject:	Re: RBD images and exclusive locking
Date:	Mon, 28 Mar 2022 12:36:25 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0

Am 25.03.22 um 18:10 schrieb Ilya Dryomov:

On Fri, Mar 25, 2022 at 5:30 PM Peter Lieven <pl@kamp.de> wrote:

Am 25.03.2022 um 16:29 schrieb Ilya Dryomov <idryomov@gmail.com>:

On Fri, Mar 25, 2022 at 12:35 PM Peter Lieven <pl@kamp.de> wrote:

Am 24.03.22 um 15:53 schrieb Peter Lieven:
Am 24.03.22 um 12:51 schrieb Peter Lieven:

Hi,


following the thread on the ceph ml about rbd and exclusive locks I was 
wondering if we should rewrite the rbd driver to

try to acquire an exclusive lock on open and not magically on the first write. 
This way we could directly terminate qemu

if the rbd image is in use like it is done with ordinary image files on a 
filesystem for some time now.


Digging a bit into the code and testing it seems that the exclusive-lock on rbd 
image is not that exclusive. We can easily start several

qemu instances accessing the same image. Watching the locks on the image, the 
lock owner of the one exclusive lock

toggles between the instances. This has potentitial for severe corruption.

I am thinking not of the case where a qemu instance gets killed or host dies. I 
am thinking of network issues where the disconnected

old instance of a VM suddenly kicks back in.


However, if I manually acquire an exclusive lock on qemu_rbd_open_image all 
other callers get EROFS.

So whats the difference between the exclusive lock that a write request obtains 
and an exclusive lock created

by rbd_lock_acquire? From the rbd lock ls perspective they are identical.


Best

Peter

To whom it may concern this is my proof of concept code to add safe locking.

Tested with live migration incl. migration from an old version without the 
locking support.

rbd_lock_acquire can take the lock that has been installed on the first write.


Considerations:

1) add switch to enable/disable locking if someone mounts the image multiple 
times on purpose (default on?)

2) we might need to check if we already have the lock in bdrv_invalidate_cache

3) I am aware of bdrv_{check,set,abort}_perm. If this are the better hooks how 
to use them with only a single lock?

4) rbd_lock_release is automatically invoked on image close.


Happy weekend

Peter


---8<---


diff --git a/block/rbd.c b/block/rbd.c
index 240b267c27..13a597e526 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1047,6 +1047,24 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
         goto failed_open;
     }

+    if (!(bdrv_get_flags(bs) & BDRV_O_INACTIVE)) {
+        error_report("qemu_rbd_open: rbd_lock_acquire");
+        r = rbd_lock_acquire(s->image, RBD_LOCK_MODE_EXCLUSIVE);

This introduces a dependency on exclusive-lock image feature.  For an image
without it, this call would always fail with EINVAL.

Going beyond that, I'm not sure this patch is needed.  --exclusive mapping
option of krbd and rbd-nbd (which is what is essentially being replicated here)
is off by default and hasn't seen much interest among users, at least as far
as I'm aware.

Thanks,

                Ilya

I am at least would urgently need it. I was not aware that the exclusive lock 
feature does not guarantee that the image is not mounted twice. There is an old 
thread from Jason wo told that it’s not what one would expect and that he hopes 
it gets implemented as soon as the appropriate hooks are there (i think this 
was back in 2016 or so)

The reason is that we try to auto recover from hardware failures. If our system 
is not able to detect this properly for whatever reason and recovers in error 
that would cause data corruption.

So I would like that feature and would vote for off by default. If someone 
(like me) enables it it’s okay if the locking fails (with an appropriate error 
message of course).

A new "exclusive" option (not a very good name for what it actually
does but I guess we are stuck with it), off by default, is fine with
me -- it would be consistent with the rest of the RBD ecosystem.

I have to admit to not understanding qemu_rbd_co_invalidate_cache
change and also consideration 3).  It would be great if you could
elaborate on those here or in the repost.



bdrv_invalidate_cache is called when a deactivated block devices is reactivated.

afaik there are only 2 callers of this function:

a) on live migration at the destination when the destination vm takes over. in 
this case the image has been opened with BDRV_O_INACTIVE.

b) if you call the "cont" command on hmp or qmp. the use case is that you want 
to continue a source vm after a migration has failed after take over of the destination 
vm.


bdrv_inactivate is called in live migration shortly before switch over to the 
destination vm.


For file based locking there exists the functions mentioned under 3). But they 
require are more fine granular locking than just an single lock.

You can find the definition here:

https://github.com/qemu/qemu/blob/b49872aa8fc0f3f5a3036cc37aa2cb5c92866f33/include/block/block-common.h#L263


Peter

[Prev in Thread]

Current Thread

[Next in Thread]

RBD images and exclusive locking, Peter Lieven, 2022/03/24
- Re: RBD images and exclusive locking, Peter Lieven, 2022/03/24
  - Re: RBD images and exclusive locking, Peter Lieven, 2022/03/25
    - Re: RBD images and exclusive locking, Ilya Dryomov, 2022/03/25
    - Re: RBD images and exclusive locking, Peter Lieven, 2022/03/25
    - Re: RBD images and exclusive locking, Ilya Dryomov, 2022/03/25
    - Re: RBD images and exclusive locking, Peter Lieven <=
  - Re: RBD images and exclusive locking, Ilya Dryomov, 2022/03/25

Prev by Date: Re: [PATCH v2] block/stream: Drain subtree around graph change
Next by Date: Re: Proposal for a regular upstream performance testing
Previous by thread: Re: RBD images and exclusive locking
Next by thread: Re: RBD images and exclusive locking
Index(es):
- Date
- Thread