qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_fil


From: Jason Dillaman
Subject: Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback
Date: Tue, 9 Jul 2019 21:42:14 -0400

On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <address@hidden> wrote:
>
> On 09.07.19 15:09, Stefano Garzarella wrote:
> > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <address@hidden> wrote:
> >>>
> >>> On 09.07.19 10:55, Max Reitz wrote:
> >>>> On 09.07.19 05:08, Jason Dillaman wrote:
> >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <address@hidden> 
> >>>>> wrote:
> >>>>>>
> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>>>>
> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>>>>> to calculate the allocated size for the image.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stefano Garzarella <address@hidden>
> >>>>>>>> ---
> >>>>>>>> v3:
> >>>>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>>>>>     [John, Jason]
> >>>>>>>> v2:
> >>>>>>>>   - calculate the actual usage only if the fast-diff feature is
> >>>>>>>>     enabled [Jason]
> >>>>>>>> ---
> >>>>>>>>  block/rbd.c | 54 
> >>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>  1 file changed, 54 insertions(+)
> >>>>>>>
> >>>>>>> Well, the librbd documentation is non-existing as always, but while
> >>>>>>> googling, I at least found that libvirt has exactly the same code.  
> >>>>>>> So I
> >>>>>>> suppose it must be quite correct, then.
> >>>>>>>
> >>>>>>
> >>>>>> While I wrote this code I took a look at libvirt implementation and 
> >>>>>> also
> >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>>>>> src/tools/rbd/action/DiskUsage.cc
> >>>>>>
> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>> index 59757b3120..b6bed683e5 100644
> >>>>>>>> --- a/block/rbd.c
> >>>>>>>> +++ b/block/rbd.c
> >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t 
> >>>>>>>> qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>>>>      return info.size;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
> >>>>>>>> exists,
> >>>>>>>> +                                 void *arg)
> >>>>>>>> +{
> >>>>>>>> +    int64_t *alloc_size = (int64_t *) arg;
> >>>>>>>> +
> >>>>>>>> +    if (exists) {
> >>>>>>>> +        (*alloc_size) += len;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState 
> >>>>>>>> *bs)
> >>>>>>>> +{
> >>>>>>>> +    BDRVRBDState *s = bs->opaque;
> >>>>>>>> +    uint64_t flags, features;
> >>>>>>>> +    int64_t alloc_size = 0;
> >>>>>>>> +    int r;
> >>>>>>>> +
> >>>>>>>> +    r = rbd_get_flags(s->image, &flags);
> >>>>>>>> +    if (r < 0) {
> >>>>>>>> +        return r;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    r = rbd_get_features(s->image, &features);
> >>>>>>>> +    if (r < 0) {
> >>>>>>>> +        return r;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    /*
> >>>>>>>> +     * We use rbd_diff_iterate2() only if the RBD image have 
> >>>>>>>> fast-diff
> >>>>>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() 
> >>>>>>>> could be
> >>>>>>>> +     * very slow on a big image.
> >>>>>>>> +     */
> >>>>>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>>>>> +        return -ENOTSUP;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    /*
> >>>>>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, 
> >>>>>>>> invokes
> >>>>>>>> +     * the callback on all allocated regions of the image.
> >>>>>>>> +     */
> >>>>>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 
> >>>>>>>> 1,
> >>>>>>>> +                          &rbd_allocated_size_cb, &alloc_size);
> >>>>>>>
> >>>>>>> But I have a question.  This is basically block_status, right?  So it
> >>>>>>> gives us information on which areas are allocated and which are not.
> >>>>>>> The result thus gives us a lower bound on the allocation size, but is 
> >>>>>>> it
> >>>>>>> really exactly the allocation size?
> >>>>>>>
> >>>>>>> There are two things I’m concerned about:
> >>>>>>>
> >>>>>>> 1. What about metadata?
> >>>>>>
> >>>>>> Good question, I don't think it includes the size used by metadata and 
> >>>>>> I
> >>>>>> don't know if there is a way to know it. I'll check better.
> >>>>>
> >>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
> >>>>> function is literally just looking for touched data blocks within the
> >>>>> RBD image.
> >>>>>
> >>>>>>>
> >>>>>>> 2. If you have multiple snapshots, this will only report the overall
> >>>>>>> allocation information, right?  So say there is something like this:
> >>>>>>>
> >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
> >>>>>>>
> >>>>>>> Snapshot 1: AAAA---
> >>>>>>> Snapshot 2: --AAAAA
> >>>>>>> Snapshot 3: -AAAA--
> >>>>>>>
> >>>>>>> I think the allocated data size is the number of As in total (13 MB).
> >>>>>>> But I suppose this API will just return 7 MB, because it looks on
> >>>>>>> everything an it sees the whole image range (7 MB) to be allocated.  
> >>>>>>> It
> >>>>>>> doesn’t report in how many snapshots some region is allocated.
> >>>>>
> >>>>> It should return 13 dirty data blocks (multipled by the size of the
> >>>>> data block) since when you don't provide a "from snapshot" name, it
> >>>>> will iterate from the first snapshot to the HEAD revision.
> >>>>
> >>>> Have you tested that?
> >>>>
> >>>> I‘m so skeptical because the callback function interface has no way of
> >>>> distinguishing between different layers of snapshots.
> >>>>
> >>>> And also because we have the bdrv_block_status_above() function which
> >>>> just looks strikingly similar (with the difference that it does not
> >>>> invoke a callback but just returns the next allocated range).  If you
> >>>> pass base=NULL to it, it will also “interpret that as the beginning of
> >>>> time and return all allocated regions of the image” (or rather image
> >>>> chain, in our case).  But it would just return 7 MB as allocated.  (Even
> >>>> though it does in fact return layer information, i.e. where a given
> >>>> continuous chunk of data is allocated.)
> >>>>
> >>>> Sure, there is no good reason for why our interface should by chance be
> >>>> the same as librbd’s interface.  But without having tested it, the fact
> >>>> that the callback cannot detect which layer a chunk is allocated on just
> >>>> makes me wary.
> >>>
> >>> And the librbd code doesn’t alleviate my concerns.
> >>>
> >>> From what I can see, it first creates a bitmap (two bits per entry) that
> >>> covers the whole image and says which objects are allocated and which
> >>> aren‘t.  Through the whole chain, that is.  So in the above example, the
> >>> bitmap would report everything as allocated.  (Or rather “updated“ in
> >>> librbd‘s terms.)
> >>>
> >>> Then it has this piece:
> >>>
> >>>>   uint64_t off = m_offset;
> >>>>   uint64_t left = m_length;
> >>>>
> >>>>   while (left > 0) {
> >>>>     uint64_t period_off = off - (off % period);
> >>>>     uint64_t read_len = min(period_off + period - off, left);
> >>>>
> >>>>     // map to extents
> >>>>     map<object_t,vector<ObjectExtent> > object_extents;
> >>>>     Striper::file_to_extents(cct, m_image_ctx.format_string,
> >>>>                              &m_image_ctx.layout, off, read_len, 0,
> >>>>                              object_extents, 0);
> >>>>
> >>>>     // get snap info for each object
> >>>>     for (map<object_t,vector<ObjectExtent> >::iterator p =
> >>>>            object_extents.begin();
> >>>>          p != object_extents.end(); ++p)
> >>> [...]
> >>>>           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> >>>>                q != p->second.end(); ++q) {
> >>>>             r = m_callback(off + q->offset, q->length, updated, 
> >>>> m_callback_arg);
> >>> [...]
> >>>>           }
> >>> [...]
> >>>>     }
> >>>>>     left -= read_len;
> >>>>     off += read_len;
> >>>>   }
> >>>
> >>> So that iterates over the whole image (in our case, because m_offset is
> >>> 0 and m_length is the image length), then picks out a chunk of read_len
> >>> length, converts that to objects, and then iterates over all of those
> >>> objects and all of their extents.
> >>>
> >>> file_to_extents looks like it’s just an arithmetic operation.  So it
> >>> doesn‘t look like that function returns extents in multiple snapshots.
> >>> It just converts a range into subranges called “objects” and “extents”
> >>> (at least that’s the way it looks to me).
> >>>
> >>> So overall, this looks awfully like it simply iterates over the whole
> >>> image and then returns allocation information gathered as a top-down
> >>> view through all of the snapshots, but not for each snapshot individually.
> >>
> >> Sorry, you're correct. The API function was originally designed to
> >> support delta extents for supporting diff exports, so while it does
> >> open each snapshot's object-map, it only returns a unioned set of
> >> delta extents. The rbd CLI's "disk-usage" action behaves how I
> >> described by counting the actual dirty block usage between snapshots.
> >
> > Max, Jason, thanks for the details!
> >
> > If you agree, I'll try to implement something similar, iterating on all
> > snapshots.
>
> Yes, that should work.
>
> > What about metadata?
> > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the 
> > metadata.
> > An idea could be to iterate over them and sum the keys-values size returned,
> > but I'm not sure they are returned in the same format as they are stored.
>
> I wouldn’t mind ignoring metadata too much.  If you can get something to
> work, even better, but I don’t consider that too important.

I don't think it's a good idea to attempt to estimate it. If there is
enough desire, we can always add a supported "disk-usage" API method
that takes into account things like the image metadata, object-map,
journaling, etc overhead. However, I wouldn't expect it to be anywhere
near the actual block usage (unless something has gone terribly wrong
w/ journaling and it fails to trim your log).

> Max
>


-- 
Jason



reply via email to

[Prev in Thread] Current Thread [Next in Thread]