>On 22.03.21 10:25, ChangLimin wrote:
>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
>> permanently.
>
>So as far as I can track back the discussion, Kevin asked on v1 why we’d
>set has_write_zeroes to false, i.e. whether the EBUSY might not go away
>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
>You haven’t explicitly replied to that question (as far as I can see),
>so it kind of still stands.
>
>Implicitly, there are two conflicting answers in this patch: On one
>hand, the commit message says “permanently”, and this is what you told
>Nir as a realistic case where this can occur.
For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below.
For other Linux version, the EBUSY may be temporary.
Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false.
>I'm afraid ChangLimin did not answer my question. I'm looking for real
>world used case when qemu cannot write zeros to multipath device, when
>nobody else is using the device.
>
>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
>once with a multipath device, and once with logical volume on a vg created
>on the multipath device, and I could not reproduce this issue.
The following is steps to reproduct the issue on Fedora 34.
# uname -a
Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Is this the most recent kernel? I have 5.11.7 in fedora 32.
# qemu-img -V
qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04
# iscsiadm -m discovery -t st -p 192.169.1.109
# iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
2. start multipathd service
# mpathconf --enable
# systemctl start multipathd
3. add multipath path
# multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN
wwid '36001405b76856e4816b48b99c6a77de3' added
# multipathd add path /dev/sdb
ok
# multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb
mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
size=1.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
`- 5:0:0:0 sdd 8:48 active ready running
You are using user_friendly_names which is (sadly) the default.
But I don't think it should matter.
4. qemu-img return EBUSY both to dm-1 and sdb
# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
qemu-img: error while writing at byte 0: Device or resource busy
# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
qemu-img: error while writing at byte 0: Device or resource busy
5. blkdiscard also return EBUSY both to dm-1 and sdb
# blkdiscard -o 0 -l 4096 /dev/dm-1
blkdiscard: cannot open /dev/dm-1: Device or resource busy
# blkdiscard -o 0 -l 4096 /dev/sdb
blkdiscard: cannot open /dev/sdb: No such file or directory
6. dd write zero is good, because it does not use blkdiscard
# dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
7. The LUN should support blkdiscard feature, otherwise it will not write zero
with ioctl(fd, BLKZEROOUT, range)
Thanks!
I could not reproduce this with kernel 5.10, but now I'm no 5.11:
# uname -r
5.11.7-100.fc32.x86_64
# qemu-img --version
qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
# cat /etc/multipath.conf
defaults {
user_friendly_names no
find_multipaths no
}
blacklist_exceptions {
property "(SCSI_IDENT_|ID_WWN)"
}
blacklist {
}
# multipath -ll 36001405e884ab8ff4b44fdba6901099c
36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
size=6.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
`- 1:0:0:9 sdk 8:160 active ready running
$ lsblk /dev/sdk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdk 8:160 0 6G 0 disk
└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath
$ virt-builder fedora-32 -o disk.img
[ 2.9] Downloading:
http://builder.libguestfs.org/fedora-32.xz[ 3.8] Planning how to build this image
[ 3.8] Uncompressing
[ 11.1] Opening the new disk
[ 16.1] Setting a random seed
[ 16.1] Setting passwords
virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
[ 17.0] Finishing off
Output file: disk.img
Output size: 6.0G
Output format: raw
Total usable space: 5.4G
Free space: 4.0G (74%)
$ qemu-img info disk.img
image: disk.img
file format: raw
virtual size: 6 GiB (6442450944 bytes)
disk size: 1.29 GiB
# qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
(100.00/100%)
Works.
# lsblk /dev/sdk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdk 8:160 0 6G 0 disk
└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath
├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part
├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part
├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part
└─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part
# qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
(100.00/100%)
Works, wiping the partitions.
# mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
# mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
This is invalid use, converting to device with mounted file system, but let's try:
# qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
(100.00/100%)
Still works?!
# wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy
# blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
Works.
This the configuration of the LUN on the server side:
{
"aio": false,
"alua_tpgs": [
{
"alua_access_state": 0,
"alua_access_status": 0,
"alua_access_type": 3,
"alua_support_active_nonoptimized": 1,
"alua_support_active_optimized": 1,
"alua_support_offline": 1,
"alua_support_standby": 1,
"alua_support_transitioning": 1,
"alua_support_unavailable": 1,
"alua_write_metadata": 0,
"implicit_trans_secs": 0,
"name": "default_tg_pt_gp",
"nonop_delay_msecs": 100,
"preferred": 0,
"tg_pt_gp_id": 0,
"trans_delay_msecs": 0
}
],
"attributes": {
"block_size": 512,
"emulate_3pc": 1,
"emulate_caw": 1,
"emulate_dpo": 1,
"emulate_fua_read": 1,
"emulate_fua_write": 1,
"emulate_model_alias": 1,
"emulate_pr": 1,
"emulate_rest_reord": 0,
"emulate_tas": 1,
"emulate_tpu": 1,
"emulate_tpws": 1,
"emulate_ua_intlck_ctrl": 0,
"emulate_write_cache": 1,
"enforce_pr_isids": 1,
"force_pr_aptpl": 0,
"is_nonrot": 0,
"max_unmap_block_desc_count": 1,
"max_unmap_lba_count": 8192,
"max_write_same_len": 65335,
"optimal_sectors": 16384,
"pi_prot_format": 0,
"pi_prot_type": 0,
"pi_prot_verify": 0,
"queue_depth": 128,
"unmap_granularity": 1,
"unmap_granularity_alignment": 0,
"unmap_zeroes_data": 0
},
"dev": "/target/3/09",
"name": "3-09",
"plugin": "fileio",
"size": 6442450944,
"write_back": true,
"wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"
},
Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7
already includes the fix?
Adding Ben, maybe he had more insight on the multipath side.
>If I understand the kernel change correctly, this can happen when there is
>a mounted file system on top of the multipath device. I don't think we have
>a use case when qemu accesses a multipath device when the device is used
>by a file system, but maybe I missed something.
>
>So that to me implies
>that we actually should not retry BLKZEROOUT, because the EBUSY will
>remain, and that condition won’t change while the block device is in use
>by qemu.
>
>On the other hand, in the code, you have decided not to reset
>has_write_zeroes to false, so the implementation will retry.
>
>EBUSY is usually a temporary error, so retrying makes sense. The question
>is if we really can write zeroes manually in this case?
>
>So I don’t quite understand. Should we keep trying BLKZEROOUT or is
>there no chance of it working after it has at one point failed with
>EBUSY? (Are there other cases besides what’s described in this commit
>message where EBUSY might be returned and it is only temporary?)
>
>> Fallback to pwritev instead of exit for -EBUSY error.
>>
>> The issue was introduced in Linux 5.10:
>>
>> Fixed in Linux 5.12:
>>
>> ---
>> block/file-posix.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 20e14f8e96..d4054ac9cb 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -1624,8 +1624,12 @@ static ssize_t
>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>> } while (errno == EINTR);
>>
>> ret = translate_err(-errno);
>> - if (ret == -ENOTSUP) {
>> - s->has_write_zeroes = false;
>> + switch (ret) {
>> + case -ENOTSUP:
>> + s->has_write_zeroes = false; /* fall through */
>> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath
>> devices */
>> + return -ENOTSUP;
>> + break;
>
>(Not sure why this break is here.)
>
>Max
>
>> }
>> }
>> #endif
>> --
>> 2.27.0
>>