qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
Date: Fri, 16 Aug 2019 15:30:11 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0

On 16.08.19 13:01, Vladimir Sementsov-Ogievskiy wrote:
> 15.08.2019 20:01, Max Reitz wrote:
>> On 15.08.19 17:21, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.08.2019 19:14, Max Reitz wrote:
>>>> Currently, check_to_replace_node() only allows mirror to replace a node
>>>> in the chain of the source node, and only if it is the first non-filter
>>>> node below the source.  Well, technically, the idea is that you can
>>>> exactly replace a quorum child by mirroring from quorum.
>>>>
>>>> This has (probably) two reasons:
>>>> (1) We do not want to create loops.
>>>> (2) @replaces and @device should have exactly the same content so
>>>>       replacing them does not cause visible data to change.
>>>>
>>>> This has two issues:
>>>> (1) It is overly restrictive.  It is completely fine for @replaces to be
>>>>       a filter.
>>>> (2) It is not restrictive enough.  You can create loops with this as
>>>>       follows:
>>>>
>>>> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
>>>> $ qemu-system-x86_64 -qmp stdio
>>>> {"execute": "qmp_capabilities"}
>>>> {"execute": "object-add",
>>>>    "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
>>>> {"execute": "blockdev-add",
>>>>    "arguments": {
>>>>        "node-name": "source",
>>>>        "driver": "throttle",
>>>>        "throttle-group": "tg0",
>>>>        "file": {
>>>>            "node-name": "filtered",
>>>>            "driver": "qcow2",
>>>>            "file": {
>>>>                "driver": "file",
>>>>                "filename": "/tmp/source.qcow2"
>>>>            } } } }
>>>> {"execute": "drive-mirror",
>>>>    "arguments": {
>>>>        "job-id": "mirror",
>>>>        "device": "source",
>>>>        "target": "/tmp/target.qcow2",
>>>>        "format": "qcow2",
>>>>        "node-name": "target",
>>>>        "sync" :"none",
>>>>        "replaces": "filtered"
>>>>    } }
>>>> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
>>>>
>>>> And qemu crashes because of a stack overflow due to the loop being
>>>> created (target's backing file is source, so when it replaces filtered,
>>>> it points to itself through source).
>>>>
>>>> (blockdev-mirror can be broken similarly.)
>>>>
>>>> So let us make the checks for the two conditions above explicit, which
>>>> makes the whole function exactly as restrictive as it needs to be.
>>>>
>>>> Signed-off-by: Max Reitz <address@hidden>
>>>> ---
>>>>    include/block/block.h |  1 +
>>>>    block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>>>>    blockdev.c            | 34 ++++++++++++++++--
>>>>    3 files changed, 110 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/block/block.h b/include/block/block.h
>>>> index 6ba853fb90..8da706cd89 100644
>>>> --- a/include/block/block.h
>>>> +++ b/include/block/block.h
>>>> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState 
>>>> *candidate);
>>>>    
>>>>    /* check if a named node can be replaced when doing drive-mirror */
>>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>> +                                        BlockDriverState *backing_bs,
>>>>                                            const char *node_name, Error 
>>>> **errp);
>>>>    
>>>>    /* async block I/O */
>>>> diff --git a/block.c b/block.c
>>>> index 915b80153c..4858d3e718 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState 
>>>> *candidate)
>>>>        return false;
>>>>    }
>>>>    
>>>> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
>>>> +{
>>>> +    BdrvChild *c;
>>>> +
>>>> +    if (!parent) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    QLIST_FOREACH(c, &parent->children, next) {
>>>> +        if (c->bs == child || is_child_of(child, c->bs)) {
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Return true if there are only filters in [@top, @base).  Note that
>>>> + * this may include quorum (which bdrv_chain_contains() cannot
>>>> + * handle).
>>>
>>> More presizely: return true if exists chain of filters from top to base or 
>>> if
>>> top == base.
>>>
>>> I keep in mind backup-top filter:
>>>
>>> [backup-top]
>>> |          \target
>>
>> backup-top can’t be a filter if it has two children with different
>> contents, though.
> 
> Why? target is special child, unrelated to what is read/written over 
> backup-top.
> It's an own business of backup-top.
> 
>>
>> (commit-top and mirror-top aren’t filters either.)
> 
> Ahm, I missed something. They have is_filter = true and their children 
> considered
> to be filtered-rw children in your series? And than, who they are? Format 
> nodes?
> And how they appears in backing chains than?

Er, right, I remember, I made them filters in patch 1 of this series. m( :-)

But the chain would still be unique, in a sense, because backup-top only
has one filtered child, so you could go down the chain with
bdrv_filtered_rw_child().

This function doesn’t do that because of Quorum, which is actually a
better example.  All of its children are filtered, so we must consider
all of them.

But backup-top is actually a reason why this function is wrong as it is;
the target is not a filtered child, so it shouldn’t return true there.

Hmmmm.

Actually, bdrv_recurse_is_first_non_filter() does nearly what we want.
(Which is why it was used here.)  The only problem is that it expects
@candidate to be a non-filter (as the name implies).  But we don’t care
about that, actually.

I suppose I can just turn bdrv_recurse_is_first_non_filter() into
bdrv_is_child_of(); it has only two callers, one is here, the other is
bdrv_is_first_non_filter().  In the latter, we can just check whether
@candidate is a filter and return false if it isn’t.

>> That’s why there must be a unique chain [@top, @base).
>>
>> I should probably not that it will return true if top == base, though, yes.
>>
>>> |backing    -------->[target]
>>> V                    /
>>> [source]  <---------/backing
>>>
>>>> + */
>>>> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState 
>>>> *base)
>>>> +{
>>>> +    BdrvChild *c;
>>>> +
>>>> +    if (!top) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    if (top == base) {
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    if (!top->drv->is_filter) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    QLIST_FOREACH(c, &top->children, next) {
>>>> +        if (is_filtered_child(c->bs, base)) {
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>
>>> interesting, how much is it better to somehow reuse DFS search written in 
>>> should_update_child()..
>>> [just note, don't do it in these series please]
>>>
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
>>>> + * will be attached to the target when mirror completes.
>>>> + */
>>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>> +                                        BlockDriverState *backing_bs,
>>>>                                            const char *node_name, Error 
>>>> **errp)
>>>>    {
>>>>        BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
>>>> @@ -6309,13 +6361,32 @@ BlockDriverState 
>>>> *check_to_replace_node(BlockDriverState *parent_bs,
>>>>            goto out;
>>>>        }
>>>>    
>>>> -    /* We don't want arbitrary node of the BDS chain to be replaced only 
>>>> the top
>>>> -     * most non filter in order to prevent data corruption.
>>>> -     * Another benefit is that this tests exclude backing files which are
>>>> -     * blocked by the backing blockers.
>>>> +    /*
>>>> +     * If to_replace_bs is (recursively) a child of backing_bs,
>>>> +     * replacing it may create a loop.  We cannot allow that.
>>>>         */
>>>> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
>>>> -        error_setg(errp, "Only top most non filter can be replaced");
>>>> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, 
>>>> backing_bs)) {
>>>
>>> first condition is covered by second, so first may be omitted.
>>
>> It is not.  is_child_of() does not return true if child == parent.
>>
>>>> +        error_setg(errp, "Replacing this node would result in a loop");
>>>> +        to_replace_bs = NULL;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Mirror is designed in such a way that when it completes, the
>>>> +     * source BDS is seamlessly replaced.
>>>
>>> Not source but to_replace_bs is replaced?
>>
>> It has originally been designed to replace the source.  If it could
>> replace any arbitrary BDS, all of this would be moot.
> 
> quorum child, you saying about in commit message?

Which is not any arbitrary BDS, but one that looks exactly like the source.

My point is, mirror has been *designed* to replace the source
seamlessly.  It can do more things today, but that was its original point.

That means that the target must be exactly the same as the source.  And
then we come to this:

>>>> It is therefore not allowed
>>>> +     * to replace a BDS where this condition would be violated, as that
>>>> +     * would defeat the purpose of mirror and could lead to data
>>>> +     * corruption.
>>>> +     * Therefore, between parent_bs and to_replace_bs there may be
>>>> +     * only filters (and the one on top must be a filter, too), so
>>>> +     * their data always stays in sync and mirror can complete and
>>>> +     * replace to_replace_bs without any possible corruptions.

So replacing a node that’s connected to the source only through filters
is fine because that means the replaced node will also have the same
content as the source.


How about I replace the first paragraph with:

At the end of the mirror job, the target exhibits exactly the same
content as the source, so it can replace the source node seamlessly.  It
cannot replace a BDS that differs in content, as that could lead to data
corruption.

?

>>>> +     */
>>>> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
>>>> +        !is_filtered_child(to_replace_bs, parent_bs))
>>>> +    {
>>>> +        error_setg(errp, "The node to be replaced must be connected to 
>>>> the "
>>>> +                   "source through filter nodes only");
>>>
>>> "and the one on top must be a filter, too" not mentioned in the error..
>>
>> Well, unless the source node is the node to be replaced.  Hm...  This
>> gets very hard to express.  I think I’d prefer to keep this as it is,
>> even though it is not quite correct, unless you have a better suggestion
>> of what to report. :-/
> 
> I can't imaging something better than just add "(and the one on top must be a 
> filter, too)"

The problem is that “the one on top” wouldn’t sound very clear to me as
a user.

Maybe include the explanation à la “The node to be replaced must be
connected to the source through filter nodes only, so its data is the
exact same at all times”?  Maybe then users can guess what this
“connected” means exactly.

>>>>            to_replace_bs = NULL;
>>>>            goto out;
>>>>        }
>>>> diff --git a/blockdev.c b/blockdev.c
>>>> index 4e72f6f701..758e0b5431 100644
>>>> --- a/blockdev.c
>>>> +++ b/blockdev.c
>>>> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char 
>>>> *job_id, BlockDriverState *bs,
>>>>        }
>>>>    
>>>>        if (has_replaces) {
>>>> -        BlockDriverState *to_replace_bs;
>>>> +        BlockDriverState *to_replace_bs, *backing_bs;
>>>>            AioContext *replace_aio_context;
>>>>            int64_t bs_size, replace_size;
>>>>    
>>>> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char 
>>>> *job_id, BlockDriverState *bs,
>>>>                return;
>>>>            }
>>>>    
>>>> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
>>>> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
>>>> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
>>>> +        {
>>>> +            /*
>>>> +             * While we do not quite know what OPEN_BACKING_CHAIN
>>>> +             * (used for mode=existing) will yield, it is probably
>>>> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
>>>> +             * because that is our best guess.
>>>> +             */
>>>> +            switch (sync) {
>>>> +            case MIRROR_SYNC_MODE_FULL:
>>>> +                backing_bs = NULL;
>>>> +                break;
>>>> +
>>>> +            case MIRROR_SYNC_MODE_TOP:
>>>> +                backing_bs = 
>>>> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
>>>
>>> why not  bdrv_backing_chain_next(bs) like in mirror_start?
>>
>> Good question.  I suppose it should be
>> bdrv_filtered_cow_bs(bdrv_backing_chain_next(bs)) in mirror_start()?
> 
> You mean bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)), I hope)

Er, yes, sure.

>> Because with sync=top, we just want to remove the topmost COW node (and
>> filters on top), but keep filters behind it.
>>
> 
> Agreed.

OK.

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]