qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 3/3] vmdk: Add read-only support for seSparse


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v2 3/3] vmdk: Add read-only support for seSparse snapshots
Date: Wed, 19 Jun 2019 19:12:33 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0

On 05.06.19 14:17, Sam Eiderman wrote:
> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
> QEMU).
> 
> This format was lacking in the following:
> 
>     * Grain directory (L1) and grain table (L2) entries were 32-bit,
>       allowing access to only 2TB (slightly less) of data.
>     * The grain size (default) was 512 bytes - leading to data
>       fragmentation and many grain tables.
>     * For space reclamation purposes, it was necessary to find all the
>       grains which are not pointed to by any grain table - so a reverse
>       mapping of "offset of grain in vmdk" to "grain table" must be
>       constructed - which takes large amounts of CPU/RAM.
> 
> The format specification can be found in VMware's documentation:
> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
> 
> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
> introduced: SESparse (Space Efficient).
> 
> This format fixes the above issues:
> 
>     * All entries are now 64-bit.
>     * The grain size (default) is 4KB.
>     * Grain directory and grain tables are now located at the beginning
>       of the file.
>       + seSparse format reserves space for all grain tables.
>       + Grain tables can be addressed using an index.
>       + Grains are located in the end of the file and can also be
>         addressed with an index.
>       - seSparse vmdks of large disks (64TB) have huge preallocated
>         headers - mainly due to L2 tables, even for empty snapshots.
>     * The header contains a reverse mapping ("backmap") of "offset of
>       grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>       specifies for each grain - whether it is allocated or not.
>       Using these data structures we can implement space reclamation
>       efficiently.
>     * Due to the fact that the header now maintains two mappings:
>         * The regular one (grain directory & grain tables)
>         * A reverse one (backmap and free bitmap)
>       These data structures can lose consistency upon crash and result
>       in a corrupted VMDK.
>       Therefore, a journal is also added to the VMDK and is replayed
>       when the VMware reopens the file after a crash.
> 
> Since ESXi 6.7 - SESparse is the only snapshot format available.
> 
> Unfortunately, VMware does not provide documentation regarding the new
> seSparse format.
> 
> This commit is based on black-box research of the seSparse format.
> Various in-guest block operations and their effect on the snapshot file
> were tested.
> 
> The only VMware provided source of information (regarding the underlying
> implementation) was a log file on the ESXi:
> 
>     /var/log/hostd.log
> 
> Whenever an seSparse snapshot is created - the log is being populated
> with seSparse records.
> 
> Relevant log records are of the form:
> 
> [...] Const Header:
> [...]  constMagic     = 0xcafebabe
> [...]  version        = 2.1
> [...]  capacity       = 204800
> [...]  grainSize      = 8
> [...]  grainTableSize = 64
> [...]  flags          = 0
> [...] Extents:
> [...]  Header         : <1 : 1>
> [...]  JournalHdr     : <2 : 2>
> [...]  Journal        : <2048 : 2048>
> [...]  GrainDirectory : <4096 : 2048>
> [...]  GrainTables    : <6144 : 2048>
> [...]  FreeBitmap     : <8192 : 2048>
> [...]  BackMap        : <10240 : 2048>
> [...]  Grain          : <12288 : 204800>
> [...] Volatile Header:
> [...] volatileMagic     = 0xcafecafe
> [...] FreeGTNumber      = 0
> [...] nextTxnSeqNumber  = 0
> [...] replayJournal     = 0
> 
> The sizes that are seen in the log file are in sectors.
> Extents are of the following format: <offset : size>
> 
> This commit is a strict implementation which enforces:
>     * magics
>     * version number 2.1
>     * grain size of 8 sectors  (4KB)
>     * grain table size of 64 sectors
>     * zero flags
>     * extent locations
> 
> Additionally, this commit proivdes only a subset of the functionality
> offered by seSparse's format:
>     * Read-only
>     * No journal replay
>     * No space reclamation
>     * No unmap support
> 
> Hence, journal header, journal, free bitmap and backmap extents are
> unused, only the "classic" (L1 -> L2 -> data) grain access is
> implemented.
> 
> However there are several differences in the grain access itself.
> Grain directory (L1):
>     * Grain directory entries are indexes (not offsets) to grain
>       tables.
>     * Valid grain directory entries have their highest nibble set to
>       0x1.
>     * Since grain tables are always located in the beginning of the
>       file - the index can fit into 32 bits - so we can use its low
>       part if it's valid.
> Grain table (L2):
>     * Grain table entries are indexes (not offsets) to grains.
>     * If the highest nibble of the entry is:
>         0x0:
>             The grain in not allocated.
>             The rest of the bytes are 0.
>         0x1:
>             The grain is unmapped - guest sees a zero grain.
>             The rest of the bits point to the previously mapped grain,
>             see 0x3 case.
>         0x2:
>             The grain is zero.
>         0x3:
>             The grain is allocated - to get the index calculate:
>             ((entry & 0x0fff000000000000) >> 48) |
>             ((entry & 0x0000ffffffffffff) << 12)
>     * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>       grain which results from the guest using sg_unmap to unmap the
>       grain - but the grain itself still exists in the grain extent - a
>       space reclamation procedure should delete it.
>       Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>       but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
> 
> In order to implement seSparse some fields had to be changed to support
> both 32-bit and 64-bit entry sizes.
> 
> Reviewed-by: Karl Heubaum <address@hidden>
> Reviewed-by: Eyal Moscovici <address@hidden>
> Reviewed-by: Arbel Moshe <address@hidden>
> Signed-off-by: Sam Eiderman <address@hidden>
> ---
>  block/vmdk.c | 357 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 341 insertions(+), 16 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 931eb2759c..4377779635 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c

[...]

> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> +                               BdrvChild *file,
> +                               int flags, Error **errp)
> +{
> +    int ret;
> +    VMDKSESparseConstHeader const_header;
> +    VMDKSESparseVolatileHeader volatile_header;
> +    VmdkExtent *extent;
> +
> +    if (flags & BDRV_O_RDWR) {
> +        error_setg(errp, "No write support for seSparse images available");
> +        return -ENOTSUP;
> +    }
Kind of works for me, but why not bdrv_apply_auto_read_only() like I had
proposed?  The advantage is that this would make the node read-only if
the user has specified auto-read-only=on instead of failing.

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]