qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block


From: Nir Soffer
Subject: Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block
Date: Thu, 22 Aug 2019 19:39:47 +0300

On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <address@hidden> wrote:
On 16.08.19 23:21, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
>
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
>
> When using preallocation=off, we always allocate at least one filesystem
> block:
>
>     $ ./qemu-img create -f raw test.raw 1g
>     Formatting 'test.raw', fmt=raw size=1073741824
>
>     $ ls -lhs test.raw
>     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
>
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
>
> Before(s)  After(s)     Diff(%)
> -------------------------------
>      356        389        +8.4
>
> I ran this only once, so we cannot tell much from these results.

So you’d expect it to be fast but it was slower?  Well, you only ran it
once and it isn’t really a precise benchmark...

> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
>
>     for i in $(seq 10); do
>         rm -f dst.raw
>         sleep 10
>         time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
>     done
>
> Here is a table comparing the total time spent:
>
> Type    Before(s)   After(s)    Diff(%)
> ---------------------------------------
> real      530.028    469.123      -11.4
> user       17.204     10.768      -37.4
> sys        17.881      7.011      -60.7
>
> Here we see very clear improvement in CPU usage.
>
> Signed-off-by: Nir Soffer <address@hidden>
> ---
>  block/file-posix.c         | 25 +++++++++++++++++++++++++
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160     |  4 ++++
>  tests/qemu-iotests/175     | 19 +++++++++++++------
>  tests/qemu-iotests/175.out |  8 ++++----
>  tests/qemu-iotests/221.out | 12 ++++++++----
>  tests/qemu-iotests/253.out | 12 ++++++++----
>  7 files changed, 63 insertions(+), 18 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>      return ret;
>  }

> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +    ssize_t n;
> +
> +    do {
> +        n = pwrite(fd, "\0", 1, 0);

This breaks when fd has been opened with O_DIRECT.

It seems that we always open images without O_DIRECT when creating an image
in qemu-img create, or when creating a target image in qemu-img convert.

Here is a trace of qemu-img create:

$ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o preallocation=falloc /tmp/gv1/src.raw 1g
Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc

1. open #1

17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
17686 18:58:23.683829 close(9)          = 0

2. open #2

17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
17686 18:58:23.684256 close(9)          = 0

3. open #3

17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9
...
17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
17688 18:58:23.690217 ftruncate(9, 0 <unfinished ...>
...
17688 18:58:23.700266 <... ftruncate resumed>) = 0
...
17688 18:58:23.700595 fstat(9,  <unfinished ...>
...
17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0, ...}) = 0
...
17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 <unfinished ...>
...
17688 18:58:23.728141 <... fallocate resumed>) = 0
...
17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1
...
17686 18:58:23.738391 close(9)          = 0

Here is convert trace:

$ strace -f -tt -o /tmp/convert.trace ./qemu-img convert -f raw -O raw -t none -T none /tmp/gv1/src.raw /tmp/gv1/dst.raw

1. open #1

18175 19:07:23.364417 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365282 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
18175 19:07:23.365323 close(10)         = 0

2. open #2

18175 19:07:23.365660 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365717 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
18175 19:07:23.365742 close(10)         = 0

3. open #3

18175 19:07:23.365839 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
...
18177 19:07:23.372112 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
18177 19:07:23.372138 ftruncate(10, 0)  = 0
...
18177 19:07:23.375760 fstat(10, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
18177 19:07:23.375788 ftruncate(10, 1073741824) = 0
18177 19:07:23.376383 pwrite64(10, "\0", 1, 0) = 1
...
18175 19:07:23.390989 close(10)         = 0

4. open #4

18175 19:07:23.391429 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.392433 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0
18175 19:07:23.392483 close(10)         = 0

5. open #5

18175 19:07:23.392743 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDWR|O_DIRECT|O_CLOEXEC) = 10
...
18175 19:07:23.393731 ioctl(10, BLKSSZGET, 0x7ffe75ead334) = -1 ENOSYS (Function not implemented)
18175 19:07:23.393784 pread64(10, 0x558796451000, 1, 0) = -1 EINVAL (Invalid argument)
18175 19:07:23.395362 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512
18175 19:07:23.395905 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
...

(Which happens when you open some file with cache.direct=on, and then
use e.g. QMP’s block_resize.)

What would be a command triggering this? I can add a test.

It isn’t that bad because eventually you simply ignore the error.  But
it still makes me wonder whether we shouldn’t write like the biggest
power of two that does not exceed the new file length or MAX_BLOCKSIZE.

It makes sense if there is a way to cause qemu-img to use O_DIRECT when creating an image.

> +    } while (n == -1 && errno == EINTR);
> +
> +    return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>                  /* posix_fallocate() doesn't set errno. */
>                  error_setg_errno(errp, -result,
>                                   "Could not preallocate new data");
> +            } else if (current_length == 0) {
> +                allocate_first_block(fd);

Should posix_fallocate() not take care of precisely this?

Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).

In this case posix_fallocate() is doing:

  for (offset += (len - 1) % increment; len > 0; offset += increment)
    {
      len -= increment;
      if (offset < st.st_size)
        {
          unsigned char c;
          ssize_t rsize = __pread (fd, &c, 1, offset);
          if (rsize < 0)
            return errno;
          /* If there is a non-zero byte, the block must have been
             allocated already.  */
          else if (rsize == 1 && c != 0)
            continue;
        }
      if (__pwrite (fd, "", 1, offset) != 1)
        return errno;
    }


So opening a file with O_DIRECT will break preallocation=falloc on such filesystems,
and writing one byte in allocate_first_block() is safe.
>              }
>          } else {
>              result = 0;

[...]

> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O "$IMGFMT" \
>          2> /dev/null
>      TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +    # We always write the first byte of an image.
> +    printf "\0" > "$TEST_IMG.out.dd"
> +
>      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none

Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
the lack of conv=notrunc)?

There is an issue only if dd open the file with O_TRUNC. I will test this again.


>      echo
> diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> index 51e62c8276..c6a3a7bb1e 100755
> --- a/tests/qemu-iotests/175
> +++ b/tests/qemu-iotests/175
> @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>  # the file size.  This function hides the resulting difference in the
>  # stat -c '%b' output.
>  # Parameter 1: Number of blocks an empty file occupies
> -# Parameter 2: Image size in bytes
> +# Parameter 2: Minimal number of blocks in an image
> +# Parameter 3: Image size in bytes
>  _filter_blocks()
>  {
>      extra_blocks=$1
> -    img_size=$2
> +    min_blocks=$2
> +    img_size=$3

> -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
> -        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> +    sed -e "s/blocks=$((extra_blocks + min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \

I don’t think adding extra_blocks to min_blocks makes sense.  Just
min_blocks alone should be what we want here.

We had failing tests (in vdsm) showing that filesystem may return more blocs than
expected even for non-empty files, so this may be needed. I did not test
it yet with a filesystem that show this issue, but I found your instructions
how to create it.

Thanks for reviewing.

Nir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]