qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block


From: Nir Soffer
Subject: Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block
Date: Thu, 22 Aug 2019 22:01:09 +0300

On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <address@hidden> wrote:

> On 22.08.19 18:39, Nir Soffer wrote:
> > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <address@hidden
> > <mailto:address@hidden>> wrote:
> >
> >     On 16.08.19 23:21, Nir Soffer wrote:
> >     > When creating an image with preallocation "off" or "falloc", the
> first
> >     > block of the image is typically not allocated. When using Gluster
> >     > storage backed by XFS filesystem, reading this block using direct
> I/O
> >     > succeeds regardless of request length, fooling alignment detection.
> >     >
> >     > In this case we fallback to a safe value (4096) instead of the
> optimal
> >     > value (512), which may lead to unneeded data copying when aligning
> >     > requests.  Allocating the first block avoids the fallback.
> >     >
> >     > When using preallocation=off, we always allocate at least one
> >     filesystem
> >     > block:
> >     >
> >     >     $ ./qemu-img create -f raw test.raw 1g
> >     >     Formatting 'test.raw', fmt=raw size=1073741824
> >     >
> >     >     $ ls -lhs test.raw
> >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >     >
> >     > I did quick performance tests for these flows:
> >     > - Provisioning a VM with a new raw image.
> >     > - Copying disks with qemu-img convert to new raw target image
> >     >
> >     > I installed Fedora 29 server on raw sparse image, measuring the
> time
> >     > from clicking "Begin installation" until the "Reboot" button
> appears:
> >     >
> >     > Before(s)  After(s)     Diff(%)
> >     > -------------------------------
> >     >      356        389        +8.4
> >     >
> >     > I ran this only once, so we cannot tell much from these results.
> >
> >     So you’d expect it to be fast but it was slower?  Well, you only ran
> it
> >     once and it isn’t really a precise benchmark...
> >
> >     > The second test was cloning the installation image with qemu-img
> >     > convert, doing 10 runs:
> >     >
> >     >     for i in $(seq 10); do
> >     >         rm -f dst.raw
> >     >         sleep 10
> >     >         time ./qemu-img convert -f raw -O raw -t none -T none
> >     src.raw dst.raw
> >     >     done
> >     >
> >     > Here is a table comparing the total time spent:
> >     >
> >     > Type    Before(s)   After(s)    Diff(%)
> >     > ---------------------------------------
> >     > real      530.028    469.123      -11.4
> >     > user       17.204     10.768      -37.4
> >     > sys        17.881      7.011      -60.7
> >     >
> >     > Here we see very clear improvement in CPU usage.
> >     >
> >     > Signed-off-by: Nir Soffer <address@hidden
> >     <mailto:address@hidden>>
> >     > ---
> >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >     >  tests/qemu-iotests/150.out |  1 +
> >     >  tests/qemu-iotests/160     |  4 ++++
> >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >     >  tests/qemu-iotests/175.out |  8 ++++----
> >     >  tests/qemu-iotests/221.out | 12 ++++++++----
> >     >  tests/qemu-iotests/253.out | 12 ++++++++----
> >     >  7 files changed, 63 insertions(+), 18 deletions(-)
> >     >
> >     > diff --git a/block/file-posix.c b/block/file-posix.c
> >     > index b9c33c8f6c..3964dd2021 100644
> >     > --- a/block/file-posix.c
> >     > +++ b/block/file-posix.c
> >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
> *opaque)
> >     >      return ret;
> >     >  }
> >     >
> >     > +/*
> >     > + * Help alignment detection by allocating the first block.
> >     > + *
> >     > + * When reading with direct I/O from unallocated area on Gluster
> >     backed by XFS,
> >     > + * reading succeeds regardless of request length. In this case we
> >     fallback to
> >     > + * safe aligment which is not optimal. Allocating the first block
> >     avoids this
> >     > + * fallback.
> >     > + *
> >     > + * Returns: 0 on success, -errno on failure.
> >     > + */
> >     > +static int allocate_first_block(int fd)
> >     > +{
> >     > +    ssize_t n;
> >     > +
> >     > +    do {
> >     > +        n = pwrite(fd, "\0", 1, 0);
> >
> >     This breaks when fd has been opened with O_DIRECT.
> >
> >
> > It seems that we always open images without O_DIRECT when creating an
> image
> > in qemu-img create, or when creating a target image in qemu-img convert.
>
> Yes.  But you don’t call this function directly from image creation code
> but instead from the truncation function.  (The former also calls the
> latter, but truncating is also an operation on its own.)
>
> [...]
>
> >     (Which happens when you open some file with cache.direct=on, and then
> >     use e.g. QMP’s block_resize.)
> >
> >
> > What would be a command triggering this? I can add a test.
>
> block_resize, as I’ve said:
>
> $ ./qemu-img create -f raw empty.img 0
>

This is extreme edge case - why would someone create such image?


> $ x86_64-softmmu/qemu-system-x86_64 \
>     -qmp stdio \
>     -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
>      <<EOF
> {'execute':'qmp_capabilities'}
>

This is probably too late for the allocation, since we already probed
the alignment before executing block_resize, and used a safe fallback
(4096).
It can help if the image is reopened, since we probe alignment again.

> {'execute':'block_resize',
>  'arguments':{'node-name':'file',
>               'size':1048576}}

EOF
> $ ./qemu-img map empty.img
> Offset          Length          Mapped to       File
>
> (You’d expect a data chunk here.)

I suppose you can get the same effect with blockdev-create and some
> format that explicitly resizes the file to some target length (LUKS does
> this, I think), but this is the most direct route.
>

I will try to handle -blockdev in the next version.

>     It isn’t that bad because eventually you simply ignore the error.  But
> >     it still makes me wonder whether we shouldn’t write like the biggest
> >     power of two that does not exceed the new file length or
> MAX_BLOCKSIZE.
> >
> >
> > It makes sense if there is a way to cause qemu-img to use O_DIRECT when
> > creating an image.
> >
> >     > +    } while (n == -1 && errno == EINTR);
> >     > +
> >     > +    return (n == -1) ? -errno : 0;
> >     > +}
> >     > +
> >     >  static int handle_aiocb_truncate(void *opaque)
> >     >  {
> >     >      RawPosixAIOData *aiocb = opaque;
> >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
> *opaque)
> >     >                  /* posix_fallocate() doesn't set errno. */
> >     >                  error_setg_errno(errp, -result,
> >     >                                   "Could not preallocate new
> data");
> >     > +            } else if (current_length == 0) {
> >     > +                allocate_first_block(fd);
> >
> >     Should posix_fallocate() not take care of precisely this?
> >
> >
> > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
> >
> > In this case posix_fallocate() is doing:
> >
> >   for (offset += (len - 1) % increment; len > 0; offset += increment)
> >     {
> >       len -= increment;
> >       if (offset < st.st_size)
> >         {
> >           unsigned char c;
> >           ssize_t rsize = __pread (fd, &c, 1, offset);
> >           if (rsize < 0)
> >             return errno;
> >           /* If there is a non-zero byte, the block must have been
> >              allocated already.  */
> >           else if (rsize == 1 && c != 0)
> >             continue;
> >         }
> >       if (__pwrite (fd, "", 1, offset) != 1)
> >         return errno;
> >     }
> >
> >
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> >
> > So opening a file with O_DIRECT will break preallocation=falloc on such
> > filesystems,
>
> But won’t the function above just fail with EINVAL?
> allocate_first_block() is executed only in case of success.
>

Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.

> and writing one byte in allocate_first_block() is safe.
> >
> >     >              }
> >     >          } else {
> >     >              result = 0;
> >
> >     [...]
> >
> >     > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> >     > index df89d3864b..ad2d054a47 100755
> >     > --- a/tests/qemu-iotests/160
> >     > +++ b/tests/qemu-iotests/160
> >     > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> >     >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip"
> >     -O "$IMGFMT" \
> >     >          2> /dev/null
> >     >      TEST_IMG="$TEST_IMG.out" _check_test_img
> >     > +
> >     > +    # We always write the first byte of an image.
> >     > +    printf "\0" > "$TEST_IMG.out.dd"
> >     > +
> >     >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip"
> status=none
> >
> >     Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> >     the lack of conv=notrunc)?
> >
> >
> > There is an issue only if dd open the file with O_TRUNC.
>
> It isn’t an issue, I just don’t understand why the printf would be
> necessary at all.
>
> dd should always truncate the output image unless conv=notrunc is
> specified.  But even if it didn’t do that, in all of these test cases it
> should copy some data from $TEST_IMG to the output, and thus should
> always overwrite the first byte anyway.
>

Right, this change is not needed.

> I will test
> > this again.
> >
> >     >
> >     >      echo
> >     > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> >     > index 51e62c8276..c6a3a7bb1e 100755
> >     > --- a/tests/qemu-iotests/175
> >     > +++ b/tests/qemu-iotests/175
> >     > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> >     >  # the file size.  This function hides the resulting difference in
> the
> >     >  # stat -c '%b' output.
> >     >  # Parameter 1: Number of blocks an empty file occupies
> >     > -# Parameter 2: Image size in bytes
> >     > +# Parameter 2: Minimal number of blocks in an image
> >     > +# Parameter 3: Image size in bytes
> >     >  _filter_blocks()
> >     >  {
> >     >      extra_blocks=$1
> >     > -    img_size=$2
> >     > +    min_blocks=$2
> >     > +    img_size=$3
> >     >
> >     > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing
> >     allocated/" \
> >     > -        -e "s/blocks=$((extra_blocks + img_size /
> >     512))\\(\$\\|[^0-9]\\)/everything allocated/"
> >     > +    sed -e "s/blocks=$((extra_blocks +
> >     min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> >
> >     I don’t think adding extra_blocks to min_blocks makes sense.  Just
> >     min_blocks alone should be what we want here.
> >
> >
> > We had failing tests (in vdsm) showing that filesystem may return more
> > blocs than
> > expected even for non-empty files, so this may be needed.
>
> But min_blocks is exactly the number of blocks of a file that has one
> allocated block.  I don’t see how adding the number of blocks an empty
> file occupies makes sense.
>

You are right. The test fails on filesystem that allocates extra blocks.

Nir


reply via email to

[Prev in Thread] Current Thread [Next in Thread]