[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Dropping posix_fallocate for -o preallocation falloc
From: |
Nir Soffer |
Subject: |
Re: Dropping posix_fallocate for -o preallocation falloc |
Date: |
Sat, 29 Aug 2020 02:49:49 +0300 |
On Tue, Aug 25, 2020 at 2:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 23.08.2020 um 16:46 hat Nir Soffer geschrieben:
> > Using -o preallocation falloc works great on NFS 4.2 and local file system,
> > when fallocate() is supported, but when it is not, posix_fallocate falls
> > back
> > to very inefficient way:
> > https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> >
> > This will read the last byte for every 4k block, and if the byte is
> > null, write one null byte.
> >
> > This minimizes the amount of data sent over the wire, but is very slow.
> >
> > In file-posix we optimize this flow by not truncating the file to the
> > final size, so this will only write one null byte for every 4k block,
> > but this is still very slow.
> >
> > Except the poor performance, we have a bug showing that for some reason,
> > this does not work well with OFD locking:
> > https://bugzilla.redhat.com/1851097
> >
> > In oVirt 4.4.2 we avoid the issue by not using -o preallocation
> > falloc. Instead we use our own fallocate helper:
> > https://github.com/oVirt/vdsm/blob/master/helpers/fallocate
> >
> > (We got feedback that the name of this helper is confusing since it does
> > destructive operation when fallocate() is not supported. We will
> > change the name)
> >
> > This helper is similar to posix_fallocate, but instead of falling back
> > to writing one byte per 4k block, it falls back to writing zeros in
> > large blocks.
> >
> > Testing shows that this improves fallocation time by 385% for one disk, and
> > 468% for 10 concurrent disk preallocation:
> > https://bugzilla.redhat.com/1850267#c25
> >
> > I think the next step is to move this change into qemu, so all users can
> > benefit from this change.
> >
> > I think the way to do this is to replace posix_fallocate() with fallocate(),
> > and fallback to "full" preallocation if fallocate is not supported.
>
> Yes, this makes sense to me.
>
> However, what full preallocation uses, is not exactly a large buffer
> either (64k). Especially with your desire to use this with O_DIRECT,
> we'd better use something like a 2 MB buffer.
>
> Actually, looking at the code, it's completely broken for O_DIRECT
> because it g_malloc()s the (then misaligned) buffer. This needs to
> become qemu_try_blockalign0(). It's currently failing:
>
> $ ./qemu-img create -f raw ~/tmp/test.raw 64M
> Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=67108864
> $ ./qemu-img resize --image-opts --preallocation=full
> driver=file,filename=/home/kwolf/tmp/test.raw,cache.direct=on 72M
> qemu-img: Could not write zeros for preallocation: Invalid argument
>
> > However with current code, in qemu-img create, we don't have a way to
> > force O_DIRECT for the preallocation, and in qemu-img convert the
> > preallocation step does not respect the -t none flag. Not using
> > O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock
> > when the kernel flushes the page cache.
>
> If we're talking about regular files, the easy way without changing QEMU
> would be to use the same way as in my example above: Create a 0 byte
> image file and then resize it with full preallocation.
>
> But I can see that passing a cache mode to image creation is desirable.
> So are probably some other options for image creation processes where
> the image is actually opened with its block driver.
>
> Possible options I see in the order from an ad-hoc hack to a fully
> generic interface:
>
> 1. Just add a cache mode create option to file-posix. qemu-img create
> would mostly ignore this option for qcow2 (it would use O_DIRECT for
> creating the raw image file, but not for opening when writing the
> qcow2 data, including preallocation)
>
> QMP blockde-create would allow you to get the desired behaviour
> because you open the raw layer explicitly there. It just wouldn't be
> available in qemu-img create.
>
> 2. Add a cache mode/flags parameter to .bdrv_co_create_opts (the legacy
> callback used for qemu-img create, but not QMP blockdev-create).
> Format drivers would then use that for their bdrv_open() call on the
> protocol layer; file-posix would directly use it for opening the
> file.
>
> 3. Add a full BlockdevOptions parameter to .bdrv_co_create_opts() that
> is used for the bdrv_open() of the protocol layer. I guess this would
> bring in features that are currently not accessible from qemu-img
> create because they can't be encoded in a URL but require options
> (such as multiple servers for rbd or gluster).
>
> 4. Have some kind of full blockdev-create support in qemu-img create.
> This wouldn't require any changes to the block drivers, but finding a
> good syntax for qemu-img create might not be trivial.
>
> Given that 1-3 are all hacks to the legacy interface, maybe 4 is what we
> should be doing to expose the full set of QEMU features in qemu-img
> create, even though it poses interface questions that we wouldn't have
> to answer with the other options.
>
> > So needed changes are:
> >
> > 1. Add a way to control cache in qemu-img create (-t none? -o cache=none?)
> > 2. Respect -t none in qemu-img convert -o preallocation falloc
>
> -o cache=none would be what results from my option 1. above.
>
> -t none is what 2. might look like.
>
> For 3. we might get something like --protocol-options cache.direct=on
>
> And finally 4. is not so easy. Maybe something like this:
>
> --create file,filename=foo.qcow2,open.node-name=img0,open.cache.direct=on
> --create qcow2,file=img0,preallocation=full,size=1G
Thanks for the tips.
This (or using 'json:') is fine for management application, but for
users I think we need
something easier to use like:
qemu-img create -f raw -o preallocation=falloc,cache=none file 1g
And for qemu-img convert, respecting -t none would be best, so users
do not have to
specify cache=none twice and using two different ways.
I don't know how this will be wired internally, I'm thinking about the
user interface.
Can we use blockdev-create internally with such an interface?
> Actually, on second thoughts, 4. doesn't really work for raw images
> without further changes because the preallocation isn't done in a format
> driver, but directly during the creation of the protocol driver. :-/
So it should move to block/raw-format.c?
I hoped this would be an easy change :-)
> > 3. Replace posix_falloate to fallocate
> >
> > https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1868
> > 4. Fall back to full zeroing if fallocate is not supported
> >
> > https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1891
> > 5. Probably use larger zero buffer, 64k is not efficient
> >
> > https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1907
>
> I agree with these. Maybe it's an easy first step that can be done while
> we're still figuring out how to get O_DIRECT used.
I'll start with this then.
Nir