qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{


From: Emanuele Giuseppe Esposito
Subject: Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()
Date: Mon, 18 Jul 2022 10:45:47 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0


Am 14/07/2022 um 18:45 schrieb Hanna Reitz:
> On 12.07.22 23:19, Emanuele Giuseppe Esposito wrote:
>> ---------
>> RFC because I am not sure about the AioContext locks.
>> - Do we need to take the new AioContext lock? what does it protect?
>> - Taking the old AioContext lock is required now, because of
>>    bdrv_drained_begin calling AIO_WAIT_WHILE that unlocks the
>>    aiocontext. If we replace it with AIO_WAIT_WHILE_UNLOCKED,
>>    could we get rid of taking every time the old AioContext?
>>    drain would be enough to protect the graph modification.
> 
> It’s been a while, but as far as I remember (which may be wrong), the
> reason for how the locks are supposed to be taken was mostly that we
> need some defined state so that we know how to invoke
> bdrv_drained_begin() and bdrv_drained_end() (i.e. call the first one
> as-is, and switch the locks around for the latter one).

Right, so bdrv_drained_begin must be always under old aiocontext lock,
bdrv_drained_end always under the new one.

If so, then I think I can make the whole thing even cleaner:

1. bdrv_drained_begin is taken in bdrv_change_aio_context, and not in a
transaction. This was the initial idea, but somehow it didn't work,
something was failing. However, I must have changed something now
because all tests are passing :)
Then old aiocontext lock is assumed to be taken by the caller.

2. old aiocontext lock is released

3. set_aio_context (actual detach + attach operation) are kept in a
commit transaciton, as it is now. They don't need any aiocontext lock,
as far as I understand. bdrv_drained_end is implemented in a .clean
transaction

4. new aiocontext lock is taken

5. either tran_abort or tran_commit, depending on the result of the
recursion.

Disadvantage: the unlock/lock mess is still there, but at least we know
now why we need that.
Advantage: nothing should break (because it is similar as it was
before), no double lock held, and I can always add an additional patch
where I use _UNLOCKED functions and see that happens. In addition, no
tran_add_tail needed.

> 
> The idea of using _UNLOCKED sounds interesting, almost too obvious. I
> can’t see why that wouldn’t work, actually.
> 
>> ----------
>>
>> Simplify the way the aiocontext can be changed in a BDS graph.
>> There are currently two problems in bdrv_try_set_aio_context:
>> - There is a confusion of AioContext locks taken and released, because
>>    we assume that old aiocontext is always taken and new one is
>>    taken inside.
> 
> Yep, and that assumption is just broken in some cases, which is the main
> pain point I’m feeling with it right now.
> 
> For example, look at bdrv_attach_child_common(): Here, we attach a child
> to a parent, so we need to get them into a common AioContext. So first
> we try to put the child into the parent’s context, and if that fails,
> we’ll try the other way, putting the parent into the child’s context.
> 
> The problem is clear: The bdrv_try_set_aio_context() call requires us to
> hold the child’s current context but not the parent’s, and the
> child_class->set_aio_ctx() call requires the exact opposite.  But we
> never switch the context we have acquired, so this can’t both work. 
> Better yet, nowhere is it defined what context a caller to
> bdrv_attach_child_common() will hold.
> 
> In practice, what happens here most of the time is that something will
> be moved from the main context to some other context, and since we’re in
> the main context already, that’ll just work.  But you can construct
> cases where something is attempted to be moved from an I/O thread into a
> different thread and then you’ll get a crash.
> 
> I’d be happy if we could do away with the requirement of having to hold
> any lock for changing a node’s AioContext.
> 
>> - It doesn't look very safe to call bdrv_drained_begin while some
>>    nodes have already switched to the new aiocontext and others haven't.
>>    This could be especially dangerous because bdrv_drained_begin
>> polls, so
>>    something else could be executed while graph is in an inconsistent
>>    state.
>>
>> Additional minor nitpick: can_set and set_ callbacks both traverse the
>> graph, both using the ignored list of visited nodes in a different way.
>>
>> Therefore, get rid of all of this and introduce a new callback,
>> change_aio_context, that uses transactions to efficiently, cleanly
>> and most importantly safely change the aiocontext of a graph.
>>
>> This new callback is a "merge" of the two previous ones:
>> - Just like can_set_aio_context, recursively traverses the graph.
>>    Marks all nodes that are visited using a GList, and checks if
>>    they *could* change the aio_context.
>> - For each node that passes the above check, add a new transaction
>>    that implements a callback that effectively changes the aiocontext.
>> - If a node is a BDS, add two transactions: one taking care of draining
>>    the node at the beginning of the list (so that will be executed first)
>>    and one at the end taking care of changing the AioContext.
>> - Once done, the recursive function returns if *all* nodes can change
>>    the AioContext. If so, commit the above transactions. Otherwise don't
>>    do anything.
>> - The transaction list contains first all "drain" transactions, so
>>    we are sure we are draining all nodes in the same context, and then
>>    all the other switch the AioContext. In this way we make sure that
>>    bdrv_drained_begin() is always called under the old AioContext, and
>>    bdrv_drained_end() under the new one.
>> - Because of the above, we don't need to release and re-acquire the
>>    old AioContext every time, as everything is done once (and not
>>    per-node drain and aiocontext change).
>>
>> Note that the "change" API is not yet invoked anywhere.
> 
> So the idea is that we introduce a completely new transaction-based API
> to change BDSs’ AioContext, and then drop the old one, right?
> 
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   block.c                            | 197 +++++++++++++++++++++++++++++
>>   include/block/block-global-state.h |   9 ++
>>   include/block/block_int-common.h   |   3 +
>>   3 files changed, 209 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 267a39c0de..bda4e1bcef 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -7437,6 +7437,51 @@ static bool
>> bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>>       return true;
>>   }
>>   +typedef struct BdrvStateSetAioContext {
>> +    AioContext *new_ctx;
>> +    BlockDriverState *bs;
>> +} BdrvStateSetAioContext;
>> +
>> +/*
>> + * Changes the AioContext used for fd handlers, timers, and BHs by this
>> + * BlockDriverState and all its children and parents.
>> + *
>> + * Must be called from the main AioContext.
>> + *
>> + * The caller must own the AioContext lock for the old AioContext of
>> bs, but it
>> + * must not own the AioContext lock for new_context (unless
>> new_context is the
>> + * same as the current context of bs).
> 
> :(  Too bad.
> 
>> + *
>> + * @visited will accumulate all visited BdrvChild object. The caller is
> 
> *objects
> 
>> + * responsible for freeing the list afterwards.
>> + */
>> +static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext
>> *ctx,
> 
> The comment says “this BlockDriverState”, but then this function is
> called “_parent_” and takes a BdrvChild object.  Not quite clear what is
> meant.
> 
> You could argue it doesn’t matter because the function acts on all
> parents and children recursively anyway, so it’s doesn’t matter whether
> this is about c->bs or c’s parent, which is true, but still, the wording
> doesn’t seem quite as clear to me as it could be.
> 
> I also wonder why this function is so syntactically focused on BdrvChild
> object when I would have thought that those objects aren’t really
> associated with any AioContext, only the BDS is.  A BdrvChild just gives
> you a way to change a BDS’s parents’ AioContext.  Is it because of
> bdrv_child_try_change_aio_context()’s ignore_child parameter?
> 
>> +                                           GSList **visited,
>> Transaction *tran,
>> +                                           Error **errp)
>> +{
>> +    GLOBAL_STATE_CODE();
>> +    if (g_slist_find(*visited, c)) {
>> +        return true;
>> +    }
>> +    *visited = g_slist_prepend(*visited, c);
>> +
>> +    /*
>> +     * A BdrvChildClass that doesn't handle AioContext changes cannot
>> +     * tolerate any AioContext changes
>> +     */
>> +    if (!c->klass->change_aio_ctx) {
>> +        char *user = bdrv_child_user_desc(c);
>> +        error_setg(errp, "Changing iothreads is not supported by %s",
>> user);
>> +        g_free(user);
>> +        return false;
>> +    }
>> +    if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
>> +        assert(!errp || *errp);
>> +        return false;
>> +    }
> 
> This makes me wonder how this is supposed to change the children’s
> AioContext, because traditionally, BdrvChildClass contains only
> callbacks that affect the parent.
> 
>> +    return true;
>> +}
>> +
>>   bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>>                                       GSList **ignore, Error **errp)
>>   {
>> @@ -7448,6 +7493,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild
>> *c, AioContext *ctx,
>>       return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
>>   }
>>   +bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
>> +                                   GSList **visited, Transaction *tran,
>> +                                   Error **errp)
>> +{
>> +    GLOBAL_STATE_CODE();
>> +    if (g_slist_find(*visited, c)) {
> 
> Not the first time that we’re using linear lookup for such a thing, but
> it really irks me badly.  I think in any language that would provide
> collections by default, we’d be using a hash map here.
> 
>> +        return true;
>> +    }
>> +    *visited = g_slist_prepend(*visited, c);
>> +    return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
>> +}
>> +
> 
> OK, so this looks like the function that will change the children’s
> AioContext, and now I guess we’ll have a function that invokes both
> bdrv_{child,parent}_change_aio_context()...  And yes, we do, that is
> bdrv_change_aio_context().  So now it looks to me like the comment
> that’s above bdrv_parent_change_aio_context() actually belongs to
> bdrv_change_aio_context().
> 
>>   /* @ignore will accumulate all visited BdrvChild object. The caller is
>>    * responsible for freeing the list afterwards. */
>>   bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>> @@ -7475,6 +7532,99 @@ bool bdrv_can_set_aio_context(BlockDriverState
>> *bs, AioContext *ctx,
>>       return true;
>>   }
>>   +static void bdrv_drained_begin_commit(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +
>> +    /* Paired with bdrv_drained_end in bdrv_drained_end_clean() */
>> +    bdrv_drained_begin(state->bs);
>> +}
>> +
>> +static void bdrv_drained_end_clean(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
>> +    AioContext *new_context = state->new_ctx;
>> +
>> +    /*
>> +     * drain only if bdrv_drained_begin_commit() and
>> +     * bdrv_set_aio_context_commit() executed
>> +     */
>> +    if (bdrv_get_aio_context(bs) == new_context) {
>> +        /* Paired with bdrv_drained_begin in
>> bdrv_drained_begin_commit() */
>> +        bdrv_drained_end(bs);
>> +    }
> 
> So as far as I understood, in bdrv_set_aio_context_ignore(), we switch
> around the currently held context because this will poll bs’s context,
> and so because bs’s context changed, we now need to hold the new
> context.  I don’t know off the top of my head whether there’s a problem
> with holding both contexts simultaneously.
> 
>> +
>> +    /* state is freed by set_aio_context->clean() */
>> +}
>> +
>> +static void bdrv_set_aio_context_commit(void *opaque)
>> +{
>> +    BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
>> +    BlockDriverState *bs = (BlockDriverState *) state->bs;
>> +    AioContext *new_context = state->new_ctx;
>> +    assert_bdrv_graph_writable(bs);
>> +
>> +    bdrv_detach_aio_context(bs);
>> +    bdrv_attach_aio_context(bs, new_context);
>> +}
>> +
>> +static TransactionActionDrv set_aio_context = {
>> +    .commit = bdrv_set_aio_context_commit,
>> +    .clean = g_free,
>> +};
>> +
>> +static TransactionActionDrv drained_begin_end = {
>> +    .commit = bdrv_drained_begin_commit,
>> +    .clean = bdrv_drained_end_clean,
>> +};
>> +
>> +/*
>> + * @visited will accumulate all visited BdrvChild object. The caller is
>> + * responsible for freeing the list afterwards.
>> + */
>> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> +                             GSList **visited, Transaction *tran,
>> +                             Error **errp)
>> +{
>> +    BdrvChild *c;
>> +    BdrvStateSetAioContext *state;
>> +
>> +    GLOBAL_STATE_CODE();
>> +
>> +    if (bdrv_get_aio_context(bs) == ctx) {
>> +        return true;
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->parents, next_parent) {
>> +        if (!bdrv_parent_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH(c, &bs->children, next) {
>> +        if (!bdrv_child_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    state = g_new(BdrvStateSetAioContext, 1);
>> +    *state = (BdrvStateSetAioContext) {
>> +        .new_ctx = ctx,
>> +        .bs = bs,
>> +    };
>> +
>> +    /* First half of transactions are drains */
>> +    tran_add(tran, &drained_begin_end, state);
>> +    /*
>> +     * Second half of transactions takes care of setting the
>> +     * AioContext and free the state
>> +     */
>> +    tran_add_tail(tran, &set_aio_context, state);
> 
> Completely subjective and I struggle to explain my feelings:
> 
> I don’t quite like the “first half, second half” explanation.  This
> makes it sound like these should be separate phases in the transaction,
> and if so, we should have an interface that allows you to put
> transaction handlers into an arbitrary number of phases.
> 
> We don’t have that interface, though, all we have are functions that
> carry the connotation of “execute as soon as possible” and “execute as
> late as possible”, so that if you use both in conjunction like here, you
> can be certain that drained_begin_end is executed before the respective
> set_aio_context.
> 
> The comments also struggle to convey how much black magic this is.
> 
> (Note that this is just my POV; also note that I’m not judging, black
> magic is cool, just difficult to grasp)
> 
> So what’s going on is (AFAIU) that we have this black magic
> drained_begin_end transaction which is added such that its commit
> handler will be run before the set_aio_context’s commit handler, but its
> clean handler will be run after set_aio_context’s commit handler.  So
> that’s why I call it black magic, because drained_begin_end can do both,
> it’ll first begin the drain, and then end it, all around some normal
> other transaction, such that this other one can operate on a completely
> drained graph.
> 
> I don’t hate this, it’s just that the comments don’t convey it.  I think
> there should be a big comment above drained_begin_end describing how
> it’s supposed to be used (i.e. you can wrap some other transaction in
> it, if you take care to add that other transaction with tran_add_tail(),
> and then its commit handlers will run on a drained graph); and then here
> we just describe that we’re doing exactly that, adding both transactions
> such that set_aio_context’s commit handler will run on a drained graph.
> 
> 
> What I don’t like is that a BdrvChildClass.change_aio_ctx implementation
> must know this.  If it adds handlers to the transaction, it must be
> aware that if it does so with tran_add(), the commit handler won’t
> necessarily run on a fully drained graph; it must use tran_add_tail()
> for this, which isn’t obvious (and might be called a layering violation).
> 
> Perhaps that doesn’t matter because none of those .change_aio_ctx commit
> handlers will care, but, well.
> 
> The point is that the drained_begin_end transaction kind of abuses the
> system and adds new non-obvious semantics.
> 
> I can’t think of something better off the top of my head, though, and as
> long as this will remain the only place to call tran_add_tail(), I
> probably shouldn’t care.
> 
> ...Ah, I can see patch 4 and 6 add exactly what I had feared.  Well,
> that isn’t great. :/
> 
>> +
>> +    return true;
>> +}
>> +
>>   int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext
>> *ctx,
>>                                      BdrvChild *ignore_child, Error
>> **errp)
>>   {
>> @@ -7498,6 +7648,53 @@ int
>> bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>>       return 0;
>>   }
>>   +/*
>> + * First, go recursively through all nodes in the graph, and see if they
>> + * can change their AioContext.
>> + * If so, add for each node a new transaction with a callback to
>> change the
>> + * AioContext with the new one.
>> + * Once recursion is completed, if all nodes are allowed to change their
>> + * AioContext then commit the transaction, running all callbacks in
>> order.
>> + * Otherwise don't do anything.
> 
> Nice explanation, but I’d start with something more simple to describe
> the external interface, like “Change bs's and recursively all of its
> parents' and children's AioContext to the given new context, returning
> an error if that isn't possible.  If ignore_child is not NULL, that
> child (and its subgraph) will not be touched.”
> 
>> + *
>> + * This function still requires the caller to take the bs current
>> + * AioContext lock, otherwise draining could fail since AIO_WAIT_WHILE
> 
> Well, let’s just say “will” instead of “could”.  Callers don’t need to
> know they can get lucky when they ignore the rules.
> 
>> + * assumes the lock is always held if bs is in another AioContext.
>> + */
>> +int bdrv_child_try_change_aio_context(BlockDriverState *bs,
>> AioContext *ctx,
>> +                                      BdrvChild *ignore_child, Error
>> **errp)
> 
> Do the other functions need to be public?  Outside of this file, this
> one seems sufficient to me, but I may well be overlooking something.
> 
> Also, why is this function called bdrv_child_*, and not just
> bdrv_try_change_aio_context()?  (Or maybe have a
> bdrv_try_change_aio_context() variant without the ignore_child
> parameter, just like bdrv_try_set_aio_context()?)
> 
> Actually, wouldn’t just bdrv_change_aio_context() be sufficient?  I
> mean, it isn’t called bdrv_try_co_pwritev().  Most functions try to do
> something and return errors if they can’t.  Not sure if we need to make
> that clear in the name.
> 
> (I may be wrong, but I suspect bdrv_try_set_aio_context() is only named
> such that the compiler will check that the bdrv_set_aio_context()
> interface is completely unused;
> 42a65f02f9b380bd8074882d5844d4ea033389cc’s commit message does seem to
> confirm that.)
> 
>> +{
>> +    Transaction *tran;
>> +    GSList *visited;
>> +    int ret;
>> +    GLOBAL_STATE_CODE();
>> +
>> +    tran = tran_new();
>> +    visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
>> +    ret = bdrv_change_aio_context(bs, ctx, &visited, tran, errp);
>> +    g_slist_free(visited);
>> +
>> +    if (!ret) {
>> +        /* just run clean() callbacks */
>> +        tran_abort(tran);
>> +        return -EPERM;
>> +    }
>> +
>> +    /* Acquire the new context, if necessary */
>> +    /* TODO: why do we need to take this AioContext lock? */
> 
> As I’ve said somewhere above, I think it’s because we need it for
> bdrv_drained_end().  I don’t know if there’s a problem with holding both
> contexts simultaneously.  (I would’ve just assumed that there must be a
> reason for bdrv_set_aio_context_ignore() to release old_context, but
> maybe there is actually no reason to it?)
> 
>> +    if (qemu_get_aio_context() != ctx) {
>> +        aio_context_acquire(ctx);
>> +    }
>> +
>> +    tran_commit(tran);
>> +
>> +    if (qemu_get_aio_context() != ctx) {
>> +        aio_context_release(ctx);
>> +    }
>> +
>> +    return 0;
>> +}
> 
> Might be nice to have a version that has @tran in its interface so
> callers can incorporate it; Vladimir wanted transactionable
> context-setting.  Then again, doesn’t seem to hard to introduce that
> when needed.
> 
>> +
>>   int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
>>                                Error **errp)
>>   {
> 
> I like the idea of reimplementing it all on top of transactions!  I
> think it needs some tuning, but it definitely looks doable.
> 
> Hanna
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]