qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 0/3] block/file-posix: Work around XFS bug


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC 0/3] block/file-posix: Work around XFS bug
Date: Fri, 25 Oct 2019 13:56:17 +0000

25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 12:58, Max Reitz wrote:
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
>>
>> We can’t wait until the XFS kernel driver is fixed, we should work
>> around the problem ourselves.
>>
>> This is an RFC for two reasons:
>> (1) I don’t know whether this is the right way to address the issue,
>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>      if so stop applying the workaround.
>>      I don’t know how we would go about this, so this series doesn’t do
>>      it.  (Hence it’s an RFC.)
>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>      driver access and modify a BdrvTrackedRequest object.
>>
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>      them serializing and wait for other conflicting requests.
>>
>>      Advantages:
>>      + Limits the impact to very specific cases
>>        (And that means it wouldn’t hurt too much to keep this workaround
>>        even when the XFS driver has been fixed)
>>      + Works around the bug where it happens, namely in file-posix
>>
>>      Disadvantages:
>>      - A bit complex
>>      - A bit of a layering violation (should file-posix have access to
>>        tracked requests?)
>>
>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>      becomes visible due to that function: I don’t think qcow2 writes
>>      zeroes in any other I/O path, and raw images are fixed in size so
>>      post-EOF writes won’t happen.
>>
>>      Advantages:
>>      + Maybe simpler, depending on how difficult it is to handle the
>>        layering violation
>>      + Also fixes the performance problem of handle_alloc_space() being
>>        slow on ppc64+XFS.
>>
>>      Disadvantages:
>>      - Huge layering violation because qcow2 would need to know whether
>>        the image is stored on XFS or not.
>>      - We’d definitely want to skip this workaround when the XFS driver
>>        has been fixed, so we need some method to find out whether it has
>>
>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>      To my knowledge I’m the only one who has provided any benchmarks for
>>      this commit, and even then I was a bit skeptical because it performs
>>      well in some cases and bad in others.  I concluded that it’s
>>      probably worth it because the “some cases” are more likely to occur.
>>
>>      Now we have this problem of corruption here (granted due to a bug in
>>      the XFS driver), and another report of massively degraded
>>      performance on ppc64
>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>      private BZ; I hate that :-/  The report is about 40 % worse
>>      performance for an in-guest fio write benchmark.)
>>
>>      So I have to ask the question about what the justification for
>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>
>>      Advantages:
>>      + Trivial
>>      + No layering violations
>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>        fixed or not
>>      + Fixes the ppc64+XFS performance problem
>>
>>      Disadvantages:
>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>        levels, whatever that means
>>
>> So this is the main reason this is an RFC: What should we do?  Is (1)
>> really the best choice?
>>
>>
>> In any case, I’ve ran the test case I showed in
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>> more than ten times with this series applied and the installation
>> succeeded every time.  (Without this series, it fails like every other
>> time.)
>>
>>
> 
> Hi!
> 
> First, great thanks for your investigation!
> 
> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is 
> significant
> in time.
> 
> I've tested a bit:
> 
> test:
> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 
> 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > 
> /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img 
> bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk 
> '{print $4}'; done; done; done
> 
> on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.291
> /ssd/test.img  cl=64K step=64K   : 0.813
> /ssd/test.img  cl=64K step=1M    : 2.799
> /ssd/test.img  cl=1M  step=4K    : 0.217
> /ssd/test.img  cl=1M  step=64K   : 0.332
> /ssd/test.img  cl=1M  step=1M    : 0.685
> /test.img      cl=64K step=4K    : 1.751
> /test.img      cl=64K step=64K   : 14.811
> /test.img      cl=64K step=1M    : 18.321
> /test.img      cl=1M  step=4K    : 0.759
> /test.img      cl=1M  step=64K   : 13.574
> /test.img      cl=1M  step=1M    : 28.970
> 
> rerun on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.295
> /ssd/test.img  cl=64K step=64K   : 0.803
> /ssd/test.img  cl=64K step=1M    : 2.921
> /ssd/test.img  cl=1M  step=4K    : 0.233
> /ssd/test.img  cl=1M  step=64K   : 0.321
> /ssd/test.img  cl=1M  step=1M    : 0.762
> /test.img      cl=64K step=4K    : 1.873
> /test.img      cl=64K step=64K   : 15.621
> /test.img      cl=64K step=1M    : 18.428
> /test.img      cl=1M  step=4K    : 0.883
> /test.img      cl=1M  step=64K   : 13.484
> /test.img      cl=1M  step=1M    : 26.244
> 
> 
> on master + revert c8bb23cbdbe32f5c326
> 
> /ssd/test.img  cl=64K step=4K    : 0.395
> /ssd/test.img  cl=64K step=64K   : 4.231
> /ssd/test.img  cl=64K step=1M    : 5.598
> /ssd/test.img  cl=1M  step=4K    : 0.352
> /ssd/test.img  cl=1M  step=64K   : 2.519
> /ssd/test.img  cl=1M  step=1M    : 38.919
> /test.img      cl=64K step=4K    : 1.758
> /test.img      cl=64K step=64K   : 9.838
> /test.img      cl=64K step=1M    : 13.384
> /test.img      cl=1M  step=4K    : 1.849
> /test.img      cl=1M  step=64K   : 19.405
> /test.img      cl=1M  step=1M    : 157.090
> 
> rerun:
> 
> /ssd/test.img  cl=64K step=4K    : 0.407
> /ssd/test.img  cl=64K step=64K   : 3.325
> /ssd/test.img  cl=64K step=1M    : 5.641
> /ssd/test.img  cl=1M  step=4K    : 0.346
> /ssd/test.img  cl=1M  step=64K   : 2.583
> /ssd/test.img  cl=1M  step=1M    : 39.692
> /test.img      cl=64K step=4K    : 1.727
> /test.img      cl=64K step=64K   : 10.058
> /test.img      cl=64K step=1M    : 13.441
> /test.img      cl=1M  step=4K    : 1.926
> /test.img      cl=1M  step=64K   : 19.738
> /test.img      cl=1M  step=1M    : 158.268
> 
> 
> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, 
> even on rotational
> disk, which means that previous assumption about calling handle_alloc_space() 
> only for ssd is
> wrong, we need smarter heuristics..
> 
> So, I'd prefer (1) or (2).
> 

About degradation in some cases: I think the problem is that one (a bit larger)
write may be faster than fast-write-zeroes + small write, as the latter means
additional write to metadata. And it's expected for small clusters in
conjunction with rotational disk. But the actual limit is dependent on specific
disk. So, I think possible solution is just sometimes try work with
handle_alloc_space and sometimes without, remember time and length of request
and make dynamic limit...
-- 
Best regards,
Vladimir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]