[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2
From: |
Vladimir Sementsov-Ogievskiy |
Subject: |
Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2 |
Date: |
Wed, 23 Oct 2019 10:39:14 +0000 |
Hi!
This is very interesting! Could you please export a branch to look at,
as patches can't be applied on master now :(
15.10.2019 18:23, Alberto Garcia wrote:
> Hi,
>
> this series adds a new feature to the qcow2 on-disk format called
> "Extended L2 Entries", which allows us to do subcluster allocation.
>
> This cover letter explains the reasons behind this proposal, the
> changes to the on-disk format, test results and pending work. If you
> are curious you can also have a look at previous discussions about
> this feature:
>
> https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
> https://lists.gnu.org/archive/html/qemu-block/2019-06/msg01155.html
>
> This is the first proper version of the patches, and I believe that
> the implementation is complete. However since I'm proposing a change
> to the on-disk format I'm labeling this as RFC because I'm expecting
> some debate. I'll remove the RFC tag and add more tests in future
> revisions.
>
> === Problem ===
>
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
>
> There are two basic problems that result from this:
>
> 1) Reading from or writing to a qcow2 image involves reading the
> corresponding entry on the L2 table that maps the guest address to
> the host address. This is very slow because it involves two I/O
> operations: one on the L2 table and the other one on the actual
> data cluster.
>
> 2) A cluster is the smallest unit of allocation. Therefore writing a
> mere 512 bytes to an empty disk requires allocating a complete
> cluster and filling it with zeroes (or with data from the backing
> image if there is one). This wastes more disk space and also has a
> negative impact on I/O.
>
> Problem (1) can be solved by caching the L2 tables in memory. The
> maximum amount of disk space used by L2 tables depends on the virtual
> disk size and the cluster size:
>
> max_l2_size = virtual_disk_size * 8 / cluster_size
>
> Because of this, the only way to reduce the size of the L2 tables is
> by increasing the cluster size (which can be any power of two between
> 512 bytes and 2 MB). But then we hit problem (2): I/O is slower and
> more disk space is wasted.
>
> === The proposal ===
>
> The proposal is to extend the qcow2 format by allowing subcluster
> allocation. The on-disk format remains essentially the same, except
> that each data cluster is internally divided into 32 subclusters of
> equal size.
>
> The way it works in practice is with a new optional feature called
> "Extended L2 Entries", that needs to be enabled when an image is
> created. With this, each entry on an L2 table is accompanied by a
> bitmap indicating the allocation state of each one of the subclusters
> for that cluster. The size of an L2 entry doubles from 64 to 128 bits.
>
> Other than L2 entries, all other data structures remain unchanged, but
> for data clusters the smallest unit of allocation is now the
> subcluster. Reference counting is still at the cluster level, because
> there is no way to reference individual subclusters. Copy-on-write on
> internal snapshots needs to copy complete clusters, so that scenario
> would not benefit from this change.
>
> I see two main use cases for this feature:
>
> a) The qcow2 image is not too large / the L2 cache is not a problem,
> but you want to increase the allocation performance. In this case
> you can have a 128KB cluster with 4KB subclusters (with 4KB being a
> common block size in ext4 and other filesystems)
>
> b) The qcow2 image is very large and you want to save metadata space
> in order to have a smaller L2 cache. In this case you can go for
> the maximum cluster size (2MB) but you want to have smaller
> subclusters to increase the allocation performance and optimize the
> disk usage.
>
> === Changes to the on-disk format ===
>
> An L2 entry is 64 bits wide, with this format (for uncompressed
> clusters):
>
> 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
> Rsrved host cluster offset of data Reserved
> (6 bits) (47 bits) (8 bits)
>
> bit 63: refcount == 1 (QCOW_OFLAG_COPIED)
> bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED)
> bit 0: all zeros (QCOW_OFLAG_ZERO)
>
> If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
> unset, and this 64-bit bitmap follows the entry:
>
> 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <---------------------------------> <--------------------------------->
> subcluster reads as zeros subcluster is allocated
> (32 bits) (32 bits)
>
> All this applies to uncompressed clusters. Compressed clusters are not
> divided into subclusters, the cluster descriptor remains exactly the
> same, and the 64-bit bitmap is not used (i.e. all bits are always 0).
>
> === Test results ===
>
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB. I ran the tests with all available
> cluster sizes starting from 4KB.
>
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
>
> Subcluster sizes are in brackets:
>
> |-----------------+----------------+-----------------|
> | Cluster size | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> | 4 KB ( N/A ) | N/A | 95 IOPS |
> | 8 KB ( N/A ) | N/A | 599 IOPS |
> | 16 KB (512 B) | 4129 IOPS | 3597 IOPS |
> | 32 KB (1 KB) | 11255 IOPS | 2642 IOPS |
> | 64 KB (2 KB) | 13341 IOPS | 1671 IOPS |
> | 128 KB (4 KB) | 12391 IOPS | 870 IOPS |
> | 256 KB (8 KB) | 9645 IOPS | 566 IOPS |
> | 512 KB (16 KB) | 4960 IOPS | 359 IOPS |
> | 1024 KB (32 KB) | 2732 IOPS | 215 IOPS |
> | 2048 KB (64 KB) | 1630 IOPS | 214 IOPS |
> |-----------------+----------------+-----------------|
>
> Here are the same tests, but without any backing image:
>
> |-----------------+----------------+-----------------|
> | Cluster size | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> | 4 KB ( N/A ) | N/A | 93 IOPS |
> | 8 KB ( N/A ) | N/A | 539 IOPS |
> | 16 KB (512 B) | 4174 IOPS | 7598 IOPS |
> | 32 KB (1 KB) | 11326 IOPS | 11957 IOPS |
> | 64 KB (2 KB) | 13516 IOPS | 13375 IOPS |
> | 128 KB (4 KB) | 12435 IOPS | 13274 IOPS |
> | 256 KB (8 KB) | 12071 IOPS | 14174 IOPS |
> | 512 KB (16 KB) | 12169 IOPS | 14343 IOPS |
> | 1024 KB (32 KB) | 12307 IOPS | 14622 IOPS |
> | 2048 KB (64 KB) | 12784 IOPS | 14574 IOPS |
> |-----------------+----------------+-----------------|
>
> Some comments about the results:
>
> - The smallest allowed cluster size for an image with subclusters is
> 16 KB (in this case the subclusters size is 512 bytes), hence the
> missing values in the 4 KB and 8 KB rows.
>
> - In images with a backing file: allocation is much faster when
> subclusters are enabled. As expected, images with a cluster size of
> 64KB perform similar to images with a subcluster size of 64KB. When
> there is no copy-on-write involved (subcluster size <= 4KB) then the
> maximum performance is achieved.
>
> - In images without a backing file: Since commit c8bb23cbdb when empty
> clusters are allocated for the first time they are filled with
> zeroes using an efficient method (typically fallocate() with
> FALLOC_FL_ZERO_RANGE). This is so fast that having subclusters here
> is actually a bit slower in most cases (although it still saves disk
> space).
>
> - The 16 KB cluster / 512 byte subcluster case is quite slow.
> I haven't debugged this but I suspect that this is because new
> clusters need to be allocated all the time, and also L2 and refcount
> tables are very small and need to grow all the time. The same pattern
> can be seen in images without subclusters.
>
> === To do ===
>
> A couple of things are missing from this series:
>
> - The ability to efficiently zero individual subclusters using
> qcow2_co_pwrite_zeroes(). At the moment only full clusters can be
> zeroed with this method.
>
> - Alternatively we could get rid of the individual "all zeroes" bits
> altogether and have 64 subclusters per cluster. We would still have
> the QCOW_OFLAG_ZERO bit in the standard cluster descriptor.
>
> - The number of subclusters per cluster is always 32. It would be
> trivial to allow configuring this, but I don't see any use case.
>
> - Tests: I have a few written that I'll add in future revisions of
> this series.
>
> - handle_alloc_space() works at the subclusters level. That is, if you
> have an unallocated 2MB cluster with 64KB subclusters, no backing
> image and you write 4KB of data, QEMU won't write zeroes to the
> affected subcluster(s) and will use handle_alloc_space() instead.
> The other subclusters won't be touched and will remain unallocated.
> This behavior is consistent with how subclusters work and saves disk
> space, but offers slightly lower performance (see test results
> above). Theoretically we could offer a setting to configure this,
> but I'm not convinced that this is very useful.
>
> ===========================
>
> As usual, feedback is welcome,
>
> Berto
>
> Alberto Garcia (23):
> qcow2: Add calculate_l2_meta()
> qcow2: Split cluster_needs_cow() out of count_cow_clusters()
> qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied()
> qcow2: Add get_l2_entry() and set_l2_entry()
> qcow2: Document the Extended L2 Entries feature
> qcow2: Add dummy has_subclusters() function
> qcow2: Add subcluster-related fields to BDRVQcow2State
> qcow2: Add offset_to_sc_index()
> qcow2: Add l2_entry_size()
> qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()
> qcow2: Add qcow2_get_subcluster_type()
> qcow2: Handle QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER
> qcow2: Add subcluster support to calculate_l2_meta()
> qcow2: Add subcluster support to qcow2_get_cluster_offset()
> qcow2: Add subcluster support to zero_in_l2_slice()
> qcow2: Add subcluster support to discard_in_l2_slice()
> qcow2: Add subcluster support to check_refcounts_l2()
> qcow2: Add subcluster support to expand_zero_clusters_in_l1()
> qcow2: Fix offset calculation in handle_dependencies()
> qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2()
> qcow2: Add subcluster support to handle_alloc_space()
> qcow2: Restrict qcow2_co_pwrite_zeroes() to full clusters only
> qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit
>
> block/qcow2-cluster.c | 547 ++++++++++++++++++++-----------
> block/qcow2-refcount.c | 38 ++-
> block/qcow2.c | 83 ++++-
> block/qcow2.h | 121 ++++++-
> docs/interop/qcow2.txt | 68 +++-
> docs/qcow2-cache.txt | 19 +-
> include/block/block_int.h | 1 +
> qapi/block-core.json | 2 +
> tests/qemu-iotests/031.out | 8 +-
> tests/qemu-iotests/036.out | 4 +-
> tests/qemu-iotests/049.out | 102 +++---
> tests/qemu-iotests/060.out | 1 +
> tests/qemu-iotests/061.out | 20 +-
> tests/qemu-iotests/065 | 18 +-
> tests/qemu-iotests/082.out | 48 ++-
> tests/qemu-iotests/085.out | 38 +--
> tests/qemu-iotests/144.out | 4 +-
> tests/qemu-iotests/182.out | 2 +-
> tests/qemu-iotests/185.out | 8 +-
> tests/qemu-iotests/198.out | 2 +
> tests/qemu-iotests/206.out | 4 +
> tests/qemu-iotests/242.out | 5 +
> tests/qemu-iotests/255.out | 8 +-
> tests/qemu-iotests/common.filter | 1 +
> 24 files changed, 817 insertions(+), 335 deletions(-)
>
--
Best regards,
Vladimir
- [RFC PATCH 14/23] qcow2: Add subcluster support to qcow2_get_cluster_offset(), (continued)
- [RFC PATCH 14/23] qcow2: Add subcluster support to qcow2_get_cluster_offset(), Alberto Garcia, 2019/10/15
- [RFC PATCH 18/23] qcow2: Add subcluster support to expand_zero_clusters_in_l1(), Alberto Garcia, 2019/10/15
- [RFC PATCH 19/23] qcow2: Fix offset calculation in handle_dependencies(), Alberto Garcia, 2019/10/15
- [RFC PATCH 23/23] qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit, Alberto Garcia, 2019/10/15
- [RFC PATCH 13/23] qcow2: Add subcluster support to calculate_l2_meta(), Alberto Garcia, 2019/10/15
- [RFC PATCH 21/23] qcow2: Add subcluster support to handle_alloc_space(), Alberto Garcia, 2019/10/15
- [RFC PATCH 17/23] qcow2: Add subcluster support to check_refcounts_l2(), Alberto Garcia, 2019/10/15
- Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2, Eric Blake, 2019/10/15
- Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2,
Vladimir Sementsov-Ogievskiy <=