Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead o

qemu-block
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead o

From:	Vladimir Sementsov-Ogievskiy
Subject:	Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date:	Mon, 28 Jan 2019 16:44:07 +0000
28.01.2019 18:59, Max Reitz wrote:
> On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
>> 18.01.2019 17:56, Max Reitz wrote:
>>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>>>> Drop write notifiers and use filter node instead. Changes:
>>>>
>>>> 1. copy-before-writes now handled by filter node, so, drop all
>>>>      is_write_notifier arguments.
>>>>
>>>> 2. we don't have intersecting requests, so their handling is dropped.
>>>> Instead, synchronization works as follows:
>>>> when backup or backup-top starts copying of some area it firstly
>>>> clears copy-bitmap bits, and nobody touches areas, not marked with
>>>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>>>> job copy operations are surrounded by bdrv region lock, which is
>>>> actually serializing request, to not interfer with guest writes and
>>>> not read changed data from source (before reading we clear
>>>> corresponding bit in copy-bitmap, so, this area is not more handled by
>>>> backup-top).
>>>>
>>>> 3. To sync with in-flight requests we now just drain hook node, we
>>>> don't need rw-lock.
>>>>
>>>> 4. After the whole backup loop (top, full, incremental modes), we need
>>>> to check for not copied clusters, which were under backup-top operation
>>>> and we skipped them, but backup-top operation failed, error returned to
>>>> the guest and dirty bits set back.
>>>>
>>>> 5. Don't create additional blk, use backup-top children for copy
>>>> operations.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>>>> ---
>>>>    block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>>>>    1 file changed, 149 insertions(+), 136 deletions(-)
>>>>
>>>> diff --git a/block/backup.c b/block/backup.c
>>>> index 88c0242b4e..e332909fb7 100644
>>>> --- a/block/backup.c
>>>> +++ b/block/backup.c
>>>
>>> [...]
>>>
>>>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>>>>    static void backup_clean(Job *job)
>>>>    {
>>>>        BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>>>> -    assert(s->target);
>>>> -    blk_unref(s->target);
>>>> +
>>>> +    /* We must clean it to not crash in backup_drain. */
>>>>        s->target = NULL;
>>>
>>> Why not set s->source to NULL along with it?  It makes sense if you're
>>> going to drop the backup-top node because both of these are its children.
>>
>> agree.
>>
>>>
>>>>    
>>>>        if (s->copy_bitmap) {
>>>>            hbitmap_free(s->copy_bitmap);
>>>>            s->copy_bitmap = NULL;
>>>>        }
>>>> +
>>>> +    bdrv_backup_top_drop(s->backup_top);
>>>>    }
>>>
>>> [...]
>>>
>>>> @@ -386,21 +319,45 @@ static int coroutine_fn 
>>>> backup_run_incremental(BackupBlockJob *job)
>>>>        bool error_is_read;
>>>>        int64_t offset;
>>>>        HBitmapIter hbi;
>>>> +    void *lock = NULL;
>>>>    
>>>>        hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>>> -    while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>>>> +    while (hbitmap_count(job->copy_bitmap)) {
>>>> +        offset = hbitmap_iter_next(&hbi);
>>>> +        if (offset == -1) {
>>>> +            /*
>>>> +             * we may have skipped some clusters, which were handled by
>>>> +             * backup-top, but failed and finished by returning error to
>>>> +             * the guest and set dirty bit back.
>>>> +             */
>>>> +            hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>>> +            offset = hbitmap_iter_next(&hbi);
>>>> +            assert(offset);
>>>
>>> I think you want to assert "offset >= 0".
>>>
>>>> +        }
>>>> +
>>>> +        lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>>>> +        /*
>>>> +         * Dirty bit is set, which means that there are no in-flight
>>>> +         * write requests on this area. We must succeed.
>>>> +         */
>>>> +        assert(lock);
>>>
>>> I'm not sure that is true right now, but more on that below in backup_run().
>>>
>>>> +
>>>>            do {
>>>>                if (yield_and_check(job)) {
>>>> +                bdrv_co_unlock(lock);
>>>>                    return 0;
>>>>                }
>>>> -            ret = backup_do_cow(job, offset,
>>>> -                                job->cluster_size, &error_is_read, false);
>>>> +            ret = backup_do_cow(job, offset, job->cluster_size, 
>>>> &error_is_read);
>>>>                if (ret < 0 && backup_error_action(job, error_is_read, 
>>>> -ret) ==
>>>>                               BLOCK_ERROR_ACTION_REPORT)
>>>>                {
>>>> +                bdrv_co_unlock(lock);
>>>>                    return ret;
>>>>                }
>>>>            } while (ret < 0);
>>>> +
>>>> +        bdrv_co_unlock(lock);
>>>> +        lock = NULL;
>>>
>>> This statement seems unnecessary here.
>>>
>>>>        }
>>>>    
>>>>        return 0;
>>>
>>> [...]
>>>
>>>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error 
>>>> **errp)
>>>>            hbitmap_set(s->copy_bitmap, 0, s->len);
>>>>        }
>>>>    
>>>> -    s->before_write.notify = backup_before_write_notify;
>>>> -    bdrv_add_before_write_notifier(bs, &s->before_write);
>>>> -
>>>>        if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>>>            /* All bits are set in copy_bitmap to allow any cluster to be 
>>>> copied.
>>>>             * This does not actually require them to be copied. */
>>>>            while (!job_is_cancelled(job)) {
>>>> -            /* Yield until the job is cancelled.  We just let our 
>>>> before_write
>>>> -             * notify callback service CoW requests. */
>>>> +            /*
>>>> +             * Yield until the job is cancelled.  We just let our 
>>>> backup-top
>>>> +             * fileter driver service CbW requests.
>>>
>>> *filter
>>>
>>>> +             */
>>>>                job_yield(job);
>>>>            }
>>>>        } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>>>            ret = backup_run_incremental(s);
>>>>        } else {
>>>
>>> [...]
>>>
>>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error 
>>>> **errp)
>>>>                if (alloced < 0) {
>>>>                    ret = alloced;
>>>>                } else {
>>>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
>>>> +                    trace_backup_do_cow_skip(job, offset);
>>>> +                    continue; /* already copied */
>>>> +                }
>>>> +                if (!lock) {
>>>> +                    lock = bdrv_co_try_lock(s->source, offset, 
>>>> s->cluster_size);
>>>> +                    /*
>>>> +                     * Dirty bit is set, which means that there are no 
>>>> in-flight
>>>> +                     * write requests on this area. We must succeed.
>>>> +                     */
>>>> +                    assert(lock);
>>>
>>> What if I have a different parent node for the source that issues
>>> concurrent writes?  This used to work fine because the before_write
>>> notifier would still work.  After this patch, that would be broken
>>> because those writes would not cause a CbW.
>>
>> But haw could you have this different parent node? After appending filter,
>> there should not be such nodes.
> 
> Unless you append them afterwards:
> 
>> And I think, during backup it should be
>> forbidden to append new parents to source, ignoring filter, as it definitely
>> breaks what filter does.
> 
> Agreed, but then this needs to be implemented.
> 
>> And it applies to other block-job with their filters.
>> If we appended a filter, we don't want someone other to write omit our 
>> filter.
>>
>>>
>>> That's not so bad because we just have to make sure that all writes go
>>> through the backup-top node.  That would make this assertion valid
>>> again, too.  But that means we cannot share PERM_WRITE; see [1].
>>
>> But we don't share PERM_WRITE on source in backup_top, only on target.
> 
> Are you sure?  The job itself shares it, and the filter shares it, too,
> as far as I can see.  It uses bdrv_filter_default_perms(), and that does
> seem to share PERM_WRITE.

And in bdrv_Filter_default_perms it does "*nshared = *nshared | BLK_PERM_WRITE"
only for child_file, it is target. Source is child_backing.

> 
>>>
>>>> +                }
>>>>                    ret = backup_do_cow(s, offset, s->cluster_size,
>>>> -                                    &error_is_read, false);
>>>> +                                    &error_is_read);
>>>>                }
>>>>                if (ret < 0) {
>>>>                    /* Depending on error action, fail now or retry cluster 
>>>> */
>>>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error 
>>>> **errp)
>>>>                        break;
>>>>                    } else {
>>>>                        offset -= s->cluster_size;
>>>> +                    retry = true;
>>>>                        continue;
>>>>                    }
>>>>                }
>>>>            }
>>>> +        if (lock) {
>>>> +            bdrv_co_unlock(lock);
>>>> +            lock = NULL;
>>>> +        }
>>>> +        if (ret == 0 && !job_is_cancelled(job) &&
>>>> +            hbitmap_count(s->copy_bitmap))
>>>> +        {
>>>> +            /*
>>>> +             * we may have skipped some clusters, which were handled by
>>>> +             * backup-top, but failed and finished by returning error to
>>>> +             * the guest and set dirty bit back.
>>>
>>> So it's a matter of a race?
>>>
>>>> +             */
>>>> +            goto iteration;
>>>> +        }
>>>
>>> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
>>> instead?  Because it would mean reindenting everything?
>>
>> Don't remember, but assume that yes. And this code is anyway "To be 
>> refactored",
>> I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.
> 
> Hm, well, if you want to refactor it later anyway...  But I don't like
> gotos that go backwards very much, unless there is a good reason to have
> them (and there isn't here).


Ok, not a real problem. Let's go on with do-while.


> 
>>>>        }
>>>>    
>>>> -    notifier_with_return_remove(&s->before_write);
>>>> +    /* wait pending CBW operations in backup-top */
>>>> +    bdrv_drain(s->backup_top);
>>>>    
>>>> -    /* wait until pending backup_do_cow() calls have completed */
>>>> -    qemu_co_rwlock_wrlock(&s->flush_rwlock);
>>>> -    qemu_co_rwlock_unlock(&s->flush_rwlock);
>>>> +    backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>>>> +    job_progress_update(job, ret + backup_top_progress -
>>>
>>> Why the "ret"?
>>
>> oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())
>>
>>>
>>>> +                        s->backup_top_progress);
>>>> +    s->backup_top_progress = backup_top_progress;
>>>
>>> So the backup-top progress is ignored during basically all of the block
>>> job until it suddenly jumps to 100 % completion?  That doesn't sound ideal.
>>>
>>> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
>>>    (And the while() loop of MODE_NONE)
>>
>>
>> It is done in backup_do_cow(), so FULL and TOP are covered.
>>
>> But you are right that MODE_NONE seems to have a problem about it.. And just 
>> updating it
>> in a while loop would not work, as I doubt that job_yield will return until 
>> job finish
>> or user interaction like pause/continue/cancel..
>>
>> So now, it looks better to call job_progress_update() from backup_top 
>> directly, and drop
>> this hack.
> 
> Hmmm...  I don't think job_*() calls belong in backup_top.  How about
> adding a callback to bdrv_backup_top_append()?

Ok for me at least as a temporary step.

> 
>>>>    
>>>>        return ret;
>>>>    }
>>>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, 
>>>> BlockDriverState *bs,
>>>>        int ret;
>>>>        int64_t cluster_size;
>>>>        HBitmap *copy_bitmap = NULL;
>>>> +    BlockDriverState *backup_top = NULL;
>>>> +    uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | 
>>>> BLK_PERM_WRITE |
>>>> +                                 BLK_PERM_WRITE_UNCHANGED | 
>>>> BLK_PERM_GRAPH_MOD;
>>>
>>> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
>>>
>>> [1] But we probably do not want to share either PERM_WRITE or
>>> PERM_WRITE_UNCHANGED because during the duration of the backup,
>>> everything should go through the backup-top filter (not sure about
>>> PERM_WRITE_UNCHANGED right now).  Or is that something that the filter
>>> node should enforce in backup_top_child_perm()?
>>
>> It's not shared perm of backup_top, it's a shared perm of block-job 
>> common.blk, which is
>> used only to "job->len is fixed, so we can't allow resize", so this part is 
>> not changed.
>>
>> So yes, the problem you mean by [1] is about backup_top_child_perm() where 
>> we share PERM_WRITE.
>> And it is described by comment, we must share this write perm, otherwise we 
>> break guest writes.
> 
> For the target, yes, but the problem is sharing it on the source.
> 
>> We share PERM_WRITE in backup_top to force its target child share PERM_WRITE 
>> on its backing,
>> as backing of target is source.
>>
>> But again, we share PERM_WRITE only on target, and it is shared in current 
>> code too.
> 
> I'm not so sure whether PERM_WRITE is shared only on the target.

Only on target, as child_file is target.

> 
>>>
>>>>    
>>>>        assert(bs);
>>>>        assert(target);
>>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, 
>>>> BlockDriverState *bs,
>>>>    
>>>>        copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>>>    
>>>> -    /* job->len is fixed, so we can't allow resize */
>>>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
>>>> -                           BLK_PERM_CONSISTENT_READ,
>>>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>>> -                           BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>>>> -                           speed, creation_flags, cb, opaque, errp);
>>>> -    if (!job) {
>>>> +    /*
>>>> +     * bdrv_get_device_name will not help to find device name starting 
>>>> from
>>>> +     * @bs after backup-top append,
>>>
>>> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
>>> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
>>> same result)
>>
>> bdrv_get_device_name goes finally through role->get_name, and only root role 
>> has
>> this handler. After append we'll have backing role instead of root.
> 
> Ah, I see, I asked the wrong question.
> 
> Why is block_job_create() called on bs and not on backup_top?  mirror
> calls it on mirror_top_bs.

Good question. I don't exactly remember, may be there are were more troubles 
with
permissions or somthing. So, I've to try it again..

What is more beneficial?

My current approach, is that job and filter are two sibling users of source 
node,
they do copying, they are synchronized. And in this way, it is better to read 
from
source directly, to not create extra intersection between job and filter..

On the other hand, if we read through the filter, we possible should do the 
whole
copy operation through the filter..

What is the difference between guest read and backup-job read, in filter POV? I 
think:

For guest read, filter MUST read (as we must handle guest request), and than, if
we don't have too much in-flight requests, ram-cache is not full, etc, we can 
handle
already read data in terms of backup, so, copy it somewhere. Or we can drop it, 
if
we can't handle it at the moment..

For job read, we even MAY not read, if our queues are full, postponing job 
request.

So

Guest read: MUST read, MAY backup
Job read: MAY read and backup

So, reading through filter has a possibility of common code path + native 
prioritization
of copy operations. This of course will need more refactoring of backup, and 
may be done
as a separate step, but definitely, I have to at least try to create job above 
the filter.

> 
>>>>                                        so let's calculate job_id before. Do
>>>> +     * it in the same way like block_job_create
>>>> +     */
>>>> +    if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>>>> +        job_id = bdrv_get_device_name(bs);
>>>> +    }
>>>> +
>>>> +    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>>>> +    if (!backup_top) {
>>>>            goto error;
>>>>        }
>>>>    
>>>> -    /* The target must match the source in size, so no resize here either 
>>>> */
>>>> -    job->target = blk_new(BLK_PERM_WRITE,
>>>> -                          BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>>> -                          BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>>>> -    ret = blk_insert_bs(job->target, target, errp);
>>>> -    if (ret < 0) {
>>>> +    /* job->len is fixed, so we can't allow resize */
>>>> +    job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
>>>
>>> Is there a reason you dropped PERM_CONSISTENT_READ here?
>>
>> Because, we don't use this blk for read now, we read through backup_top 
>> child.
> 
> Makes sense.
> 
>>>> +                           all_except_resize, speed, creation_flags,
>>>> +                           cb, opaque, errp);
>>>> +    if (!job) {
>>>>            goto error;
>>>>        }
>>>>    
>>>> +    job->source = backup_top->backing;
>>>> +    job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
>>>
>>> This looks really ugly.  I think as long as the block job performs
>>> writes itself, it should use its own BlockBackend.
>>
>> They are not BlockBackends, they are BdrvChildren.
> 
> Exactly, which is what I don't like.  They are children of a node that
> is implemented in a different file, it looks weird to use them here.
> 
>> It was Kevin's idea to reuse filter's
>> children in backup job:
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html
> 
> It's still ugly if backup_top is in a different file.  Well, maybe just
> to me.
> 
>> Hmm, and this is also why I need PERM_WRITE in backup_top, to write to 
>> target.
>>
>>>
>>> Alternatively, I think it would make sense for the backup-top filter to
>>> offer functions that this job can use to issue writes to the target.
>>> Then the backup job would no longer need direct access to the target as
>>> a BdrvChild.
> 
> So what would be the problem with this?
> 

I have no specific arguments against, I'll try. Kevin, do have comments on this?

-- 
Best regards,
Vladimir
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/18
- Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
  - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/28
    - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy <=
    - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/28
    - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
    - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Kevin Wolf, 2019/01/28
    - Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
Prev by Date: Re: [Qemu-block] [PATCH 3/3] virtio-scsi: Forbid devices with different iothreads sharing a blockdev
Next by Date: Re: [Qemu-block] [Qemu-devel] [PATCH 0/2] block: add blk_lookup() for getting device by node_name
Previous by thread: Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Next by thread: Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Index(es):
- Date
- Thread