qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{


From: Hanna Reitz
Subject: Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()
Date: Thu, 14 Jul 2022 18:45:33 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0

On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:
---------
RFC because I am not sure about the AioContext locks.
- Do we need to take the new AioContext lock? what does it protect?
- Taking the old AioContext lock is required now, because of
   bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
   aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
   could we get rid of taking every time the old AioContext?
   drain would be enough to protect the graph modification.

It’s been a while, but as far as I remember (which may be wrong), the reason for how the locks are supposed to be taken was mostly that we need some defined state so that we know how to invoke bdrv_drained_begin() and bdrv_drained_end() (i.e. call the first one as-is, and switch the locks around for the latter one).

The idea of using _UNLOCKED sounds interesting, almost too obvious. I can’t see why that wouldn’t work, actually.

----------

Simplify the way the aiocontext can be changed in a BDS graph.
There are currently two problems in bdrv_try_set_aio_context:
- There is a confusion of AioContext locks taken and released, because
   we assume that old aiocontext is always taken and new one is
   taken inside.

Yep, and that assumption is just broken in some cases, which is the main pain point I’m feeling with it right now.

For example, look at bdrv_attach_child_common(): Here, we attach a child to a parent, so we need to get them into a common AioContext. So first we try to put the child into the parent’s context, and if that fails, we’ll try the other way, putting the parent into the child’s context.

The problem is clear: The bdrv_try_set_aio_context() call requires us to hold the child’s current context but not the parent’s, and the child_class->set_aio_ctx() call requires the exact opposite.  But we never switch the context we have acquired, so this can’t both work.  Better yet, nowhere is it defined what context a caller to bdrv_attach_child_common() will hold.

In practice, what happens here most of the time is that something will be moved from the main context to some other context, and since we’re in the main context already, that’ll just work.  But you can construct cases where something is attempted to be moved from an I/O thread into a different thread and then you’ll get a crash.

I’d be happy if we could do away with the requirement of having to hold any lock for changing a node’s AioContext.

- It doesn't look very safe to call bdrv_drained_begin while some
   nodes have already switched to the new aiocontext and others haven't.
   This could be especially dangerous because bdrv_drained_begin polls, so
   something else could be executed while graph is in an inconsistent
   state.

Additional minor nitpick: can_set and set_ callbacks both traverse the
graph, both using the ignored list of visited nodes in a different way.

Therefore, get rid of all of this and introduce a new callback,
change_aio_context, that uses transactions to efficiently, cleanly
and most importantly safely change the aiocontext of a graph.

This new callback is a "merge" of the two previous ones:
- Just like can_set_aio_context, recursively traverses the graph.
   Marks all nodes that are visited using a GList, and checks if
   they *could* change the aio_context.
- For each node that passes the above check, add a new transaction
   that implements a callback that effectively changes the aiocontext.
- If a node is a BDS, add two transactions: one taking care of draining
   the node at the beginning of the list (so that will be executed first)
   and one at the end taking care of changing the AioContext.
- Once done, the recursive function returns if *all* nodes can change
   the AioContext. If so, commit the above transactions. Otherwise don't
   do anything.
- The transaction list contains first all "drain" transactions, so
   we are sure we are draining all nodes in the same context, and then
   all the other switch the AioContext. In this way we make sure that
   bdrv_drained_begin() is always called under the old AioContext, and
   bdrv_drained_end() under the new one.
- Because of the above, we don't need to release and re-acquire the
   old AioContext every time, as everything is done once (and not
   per-node drain and aiocontext change).

Note that the "change" API is not yet invoked anywhere.

So the idea is that we introduce a completely new transaction-based API to change BDSs’ AioContext, and then drop the old one, right?

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
  block.c                            | 197 +++++++++++++++++++++++++++++
  include/block/block-global-state.h |   9 ++
  include/block/block_int-common.h   |   3 +
  3 files changed, 209 insertions(+)

diff --git a/block.c b/block.c
index 267a39c0de..bda4e1bcef 100644
--- a/block.c
+++ b/block.c
@@ -7437,6 +7437,51 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild 
*c, AioContext *ctx,
      return true;
  }
+typedef struct BdrvStateSetAioContext {
+    AioContext *new_ctx;
+    BlockDriverState *bs;
+} BdrvStateSetAioContext;
+
+/*
+ * Changes the AioContext used for fd handlers, timers, and BHs by this
+ * BlockDriverState and all its children and parents.
+ *
+ * Must be called from the main AioContext.
+ *
+ * The caller must own the AioContext lock for the old AioContext of bs, but it
+ * must not own the AioContext lock for new_context (unless new_context is the
+ * same as the current context of bs).

:(  Too bad.

+ *
+ * @visited will accumulate all visited BdrvChild object. The caller is

*objects

+ * responsible for freeing the list afterwards.
+ */
+static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,

The comment says “this BlockDriverState”, but then this function is called “_parent_” and takes a BdrvChild object.  Not quite clear what is meant.

You could argue it doesn’t matter because the function acts on all parents and children recursively anyway, so it’s doesn’t matter whether this is about c->bs or c’s parent, which is true, but still, the wording doesn’t seem quite as clear to me as it could be.

I also wonder why this function is so syntactically focused on BdrvChild object when I would have thought that those objects aren’t really associated with any AioContext, only the BDS is.  A BdrvChild just gives you a way to change a BDS’s parents’ AioContext.  Is it because of bdrv_child_try_change_aio_context()’s ignore_child parameter?

+                                           GSList **visited, Transaction *tran,
+                                           Error **errp)
+{
+    GLOBAL_STATE_CODE();
+    if (g_slist_find(*visited, c)) {
+        return true;
+    }
+    *visited = g_slist_prepend(*visited, c);
+
+    /*
+     * A BdrvChildClass that doesn't handle AioContext changes cannot
+     * tolerate any AioContext changes
+     */
+    if (!c->klass->change_aio_ctx) {
+        char *user = bdrv_child_user_desc(c);
+        error_setg(errp, "Changing iothreads is not supported by %s", user);
+        g_free(user);
+        return false;
+    }
+    if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
+        assert(!errp || *errp);
+        return false;
+    }

This makes me wonder how this is supposed to change the children’s AioContext, because traditionally, BdrvChildClass contains only callbacks that affect the parent.

+    return true;
+}
+
  bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
                                      GSList **ignore, Error **errp)
  {
@@ -7448,6 +7493,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild *c, 
AioContext *ctx,
      return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
  }
+bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
+                                   GSList **visited, Transaction *tran,
+                                   Error **errp)
+{
+    GLOBAL_STATE_CODE();
+    if (g_slist_find(*visited, c)) {

Not the first time that we’re using linear lookup for such a thing, but it really irks me badly.  I think in any language that would provide collections by default, we’d be using a hash map here.

+        return true;
+    }
+    *visited = g_slist_prepend(*visited, c);
+    return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
+}
+

OK, so this looks like the function that will change the children’s AioContext, and now I guess we’ll have a function that invokes both bdrv_{child,parent}_change_aio_context()...  And yes, we do, that is bdrv_change_aio_context().  So now it looks to me like the comment that’s above bdrv_parent_change_aio_context() actually belongs to bdrv_change_aio_context().

  /* @ignore will accumulate all visited BdrvChild object. The caller is
   * responsible for freeing the list afterwards. */
  bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
@@ -7475,6 +7532,99 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, 
AioContext *ctx,
      return true;
  }
+static void bdrv_drained_begin_commit(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+
+    /* Paired with bdrv_drained_end in bdrv_drained_end_clean() */
+    bdrv_drained_begin(state->bs);
+}
+
+static void bdrv_drained_end_clean(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+    BlockDriverState *bs = (BlockDriverState *) state->bs;
+    AioContext *new_context = state->new_ctx;
+
+    /*
+     * drain only if bdrv_drained_begin_commit() and
+     * bdrv_set_aio_context_commit() executed
+     */
+    if (bdrv_get_aio_context(bs) == new_context) {
+        /* Paired with bdrv_drained_begin in bdrv_drained_begin_commit() */
+        bdrv_drained_end(bs);
+    }

So as far as I understood, in bdrv_set_aio_context_ignore(), we switch around the currently held context because this will poll bs’s context, and so because bs’s context changed, we now need to hold the new context.  I don’t know off the top of my head whether there’s a problem with holding both contexts simultaneously.

+
+    /* state is freed by set_aio_context->clean() */
+}
+
+static void bdrv_set_aio_context_commit(void *opaque)
+{
+    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
+    BlockDriverState *bs = (BlockDriverState *) state->bs;
+    AioContext *new_context = state->new_ctx;
+    assert_bdrv_graph_writable(bs);
+
+    bdrv_detach_aio_context(bs);
+    bdrv_attach_aio_context(bs, new_context);
+}
+
+static TransactionActionDrv set_aio_context = {
+    .commit = bdrv_set_aio_context_commit,
+    .clean = g_free,
+};
+
+static TransactionActionDrv drained_begin_end = {
+    .commit = bdrv_drained_begin_commit,
+    .clean = bdrv_drained_end_clean,
+};
+
+/*
+ * @visited will accumulate all visited BdrvChild object. The caller is
+ * responsible for freeing the list afterwards.
+ */
+bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                             GSList **visited, Transaction *tran,
+                             Error **errp)
+{
+    BdrvChild *c;
+    BdrvStateSetAioContext *state;
+
+    GLOBAL_STATE_CODE();
+
+    if (bdrv_get_aio_context(bs) == ctx) {
+        return true;
+    }
+
+    QLIST_FOREACH(c, &bs->parents, next_parent) {
+        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
+            return false;
+        }
+    }
+
+    QLIST_FOREACH(c, &bs->children, next) {
+        if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
+            return false;
+        }
+    }
+
+    state = g_new(BdrvStateSetAioContext, 1);
+    *state = (BdrvStateSetAioContext) {
+        .new_ctx = ctx,
+        .bs = bs,
+    };
+
+    /* First half of transactions are drains */
+    tran_add(tran, &drained_begin_end, state);
+    /*
+     * Second half of transactions takes care of setting the
+     * AioContext and free the state
+     */
+    tran_add_tail(tran, &set_aio_context, state);

Completely subjective and I struggle to explain my feelings:

I don’t quite like the “first half, second half” explanation.  This makes it sound like these should be separate phases in the transaction, and if so, we should have an interface that allows you to put transaction handlers into an arbitrary number of phases.

We don’t have that interface, though, all we have are functions that carry the connotation of “execute as soon as possible” and “execute as late as possible”, so that if you use both in conjunction like here, you can be certain that drained_begin_end is executed before the respective set_aio_context.

The comments also struggle to convey how much black magic this is.

(Note that this is just my POV; also note that I’m not judging, black magic is cool, just difficult to grasp)

So what’s going on is (AFAIU) that we have this black magic drained_begin_end transaction which is added such that its commit handler will be run before the set_aio_context’s commit handler, but its clean handler will be run after set_aio_context’s commit handler.  So that’s why I call it black magic, because drained_begin_end can do both, it’ll first begin the drain, and then end it, all around some normal other transaction, such that this other one can operate on a completely drained graph.

I don’t hate this, it’s just that the comments don’t convey it.  I think there should be a big comment above drained_begin_end describing how it’s supposed to be used (i.e. you can wrap some other transaction in it, if you take care to add that other transaction with tran_add_tail(), and then its commit handlers will run on a drained graph); and then here we just describe that we’re doing exactly that, adding both transactions such that set_aio_context’s commit handler will run on a drained graph.


What I don’t like is that a BdrvChildClass.change_aio_ctx implementation must know this.  If it adds handlers to the transaction, it must be aware that if it does so with tran_add(), the commit handler won’t necessarily run on a fully drained graph; it must use tran_add_tail() for this, which isn’t obvious (and might be called a layering violation).

Perhaps that doesn’t matter because none of those .change_aio_ctx commit handlers will care, but, well.

The point is that the drained_begin_end transaction kind of abuses the system and adds new non-obvious semantics.

I can’t think of something better off the top of my head, though, and as long as this will remain the only place to call tran_add_tail(), I probably shouldn’t care.

...Ah, I can see patch 4 and 6 add exactly what I had feared.  Well, that isn’t great. :/

+
+    return true;
+}
+
  int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                                     BdrvChild *ignore_child, Error **errp)
  {
@@ -7498,6 +7648,53 @@ int bdrv_child_try_set_aio_context(BlockDriverState *bs, 
AioContext *ctx,
      return 0;
  }
+/*
+ * First, go recursively through all nodes in the graph, and see if they
+ * can change their AioContext.
+ * If so, add for each node a new transaction with a callback to change the
+ * AioContext with the new one.
+ * Once recursion is completed, if all nodes are allowed to change their
+ * AioContext then commit the transaction, running all callbacks in order.
+ * Otherwise don't do anything.

Nice explanation, but I’d start with something more simple to describe the external interface, like “Change bs's and recursively all of its parents' and children's AioContext to the given new context, returning an error if that isn't possible.  If ignore_child is not NULL, that child (and its subgraph) will not be touched.”

+ *
+ * This function still requires the caller to take the bs current
+ * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE

Well, let’s just say “will” instead of “could”.  Callers don’t need to know they can get lucky when they ignore the rules.

+ * assumes the lock is always held if bs is in another AioContext.
+ */
+int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                      BdrvChild *ignore_child, Error **errp)

Do the other functions need to be public?  Outside of this file, this one seems sufficient to me, but I may well be overlooking something.

Also, why is this function called bdrv_child_*, and not just bdrv_try_change_aio_context()?  (Or maybe have a bdrv_try_change_aio_context() variant without the ignore_child parameter, just like bdrv_try_set_aio_context()?)

Actually, wouldn’t just bdrv_change_aio_context() be sufficient?  I mean, it isn’t called bdrv_try_co_pwritev().  Most functions try to do something and return errors if they can’t.  Not sure if we need to make that clear in the name.

(I may be wrong, but I suspect bdrv_try_set_aio_context() is only named such that the compiler will check that the bdrv_set_aio_context() interface is completely unused; 42a65f02f9b380bd8074882d5844d4ea033389cc’s commit message does seem to confirm that.)

+{
+    Transaction *tran;
+    GSList *visited;
+    int ret;
+    GLOBAL_STATE_CODE();
+
+    tran = tran_new();
+    visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
+    ret = bdrv_change_aio_context(bs, ctx, &visited, tran, errp);
+    g_slist_free(visited);
+
+    if (!ret) {
+        /* just run clean() callbacks */
+        tran_abort(tran);
+        return -EPERM;
+    }
+
+    /* Acquire the new context, if necessary */
+    /* TODO: why do we need to take this AioContext lock? */

As I’ve said somewhere above, I think it’s because we need it for bdrv_drained_end().  I don’t know if there’s a problem with holding both contexts simultaneously.  (I would’ve just assumed that there must be a reason for bdrv_set_aio_context_ignore() to release old_context, but maybe there is actually no reason to it?)

+    if (qemu_get_aio_context() != ctx) {
+        aio_context_acquire(ctx);
+    }
+
+    tran_commit(tran);
+
+    if (qemu_get_aio_context() != ctx) {
+        aio_context_release(ctx);
+    }
+
+    return 0;
+}

Might be nice to have a version that has @tran in its interface so callers can incorporate it; Vladimir wanted transactionable context-setting.  Then again, doesn’t seem to hard to introduce that when needed.

+
  int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                               Error **errp)
  {

I like the idea of reimplementing it all on top of transactions!  I think it needs some tuning, but it definitely looks doable.

Hanna




reply via email to

[Prev in Thread] Current Thread [Next in Thread]