Re: [RFC 0/3] block/file-posix: Work around XFS bug

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 0/3] block/file-posix: Work around XFS bug

From:	Max Reitz
Subject:	Re: [RFC 0/3] block/file-posix: Work around XFS bug
Date:	Tue, 29 Oct 2019 13:11:30 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1

On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 14:55, Max Reitz wrote:
>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.10.2019 11:50, Max Reitz wrote:
>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>        To my knowledge I’m the only one who has provided any 
>>>>>>>> benchmarks for
>>>>>>>>        this commit, and even then I was a bit skeptical because it 
>>>>>>>> performs
>>>>>>>>        well in some cases and bad in others.  I concluded that it’s
>>>>>>>>        probably worth it because the “some cases” are more likely to 
>>>>>>>> occur.
>>>>>>>>
>>>>>>>>        Now we have this problem of corruption here (granted due to a 
>>>>>>>> bug in
>>>>>>>>        the XFS driver), and another report of massively degraded
>>>>>>>>        performance on ppc64
>>>>>>>>        (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>        private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>        performance for an in-guest fio write benchmark.)
>>>>>>>>
>>>>>>>>        So I have to ask the question about what the justification for
>>>>>>>>        keeping c8bb23cbdbe32f is.  How much does performance increase 
>>>>>>>> with
>>>>>>>>        it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>
>>>>>>>>        Advantages:
>>>>>>>>        + Trivial
>>>>>>>>        + No layering violations
>>>>>>>>        + We wouldn’t need to keep track of whether the kernel bug has 
>>>>>>>> been
>>>>>>>>          fixed or not
>>>>>>>>        + Fixes the ppc64+XFS performance problem
>>>>>>>>
>>>>>>>>        Disadvantages:
>>>>>>>>        - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>          levels, whatever that means
>>>>>>>
>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>> an option.
>>>>>>
>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>> is a possible solution, too?
>>>>>
>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>
>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>> because you’re storing images on block devices) or would it be
>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>> bdrv_has_zero_init_truncate() are true?
>>>
>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>> Mark sublusters as zeroes by metadata?
>>
>> Why do you need to zero COW areas?  For normal files, any data will read
>> as zero if you didn’t write anything there.
> 
> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
> as it may be reused previously allocated cluster..

Hm, right.  We could special-case something for beyond the EOF, but I
don’t know whether that really makes it better.

OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
existing clusters should be rare enough that maybe the existing code
which explicitly writes zeroes is sufficient.

>>
>>> But still we'll have COW areas in subcluster, and we'll need to directly 
>>> zero
>>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>>
>>>>
>>>> I’m asking because Dave Chinner said
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>>> fallocate() is always slow at least with aio=native because it needs to
>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>>> pipeline to stall.
>>>>
>>>> (He suggested using XFS extent size hints to get the same effect as
>>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>>
>>>>>> We already have some cases where the existing handle_alloc_space()
>>>>>> causes performance to actually become worse, and serialising requests as
>>>>>> a workaround isn't going to make performance any better. So even on
>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>>
>>>>>
>>>>> Can keeping handle_alloc_space under some config option be an option?
>>>>
>>>> Hm.  A config option is weird if you’re the only one who’s going to
>>>> enable it.  But other than that I don’t have anything against it.
>>>>
>>>
>>> It's just a bit easier for us to maintain config option, than out-of-tree 
>>> patch.
>>> On the other hand, it's not a real problem to maintain this one patch in 
>>> separate.
>>> It may return again to the tree, when XFS bug fixed.
>>
>> We’ll still have the problem that fallocate() must stall aio=native
>> requests.
>>
> 
> Does it mean that fallocate is bad in general? Practice shows the opposite..
> At least I have my examples with qemu-img bench. Can that thing be shown with
> qemu-img bench or something?

I haven’t done benchmarks yet, but I don’t think Laurent will mind if I
paste his fio benchmark results from the unfortunately private BZ here:

         QCOW2 ON |          EXT4         |          XFS          |
                  |                       |                       |
(rw, bs, iodepth) |   2.12.0  |   4.1.0   |   2.12.0  |   4.1.0   |
------------------+-----------+-----------+-----------+-----------+
(write, 16k, 1)   | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s |
------------------+-----------+-----------+-----------+-----------+
(write, 64k, 1)   | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s |
------------------+-----------+-----------+-----------+-----------+
(write,  4k, 8)   |  535KiB/s |  524KiB/s | 13.9MiB/s | 1662KiB/s |
------------------+-----------+-----------+-----------+-----------+

And that just looks like ext4 performs worse with aio=native in general.
 But I’ll have to do my own benchmarks first.

Max

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC 0/3] block/file-posix: Work around XFS bug, (continued)

Prev by Date: Re: [PATCH] fw_cfg: Allow reboot-timeout=-1 again
Next by Date: Re: [PATCH v6 06/11] qapi: add failover negotiated event
Previous by thread: Re: [RFC 0/3] block/file-posix: Work around XFS bug
Next by thread: Re: [RFC 0/3] block/file-posix: Work around XFS bug
Index(es):
- Date
- Thread