[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Lost partition tables on ide-hd + ahci drive
From: |
Fiona Ebner |
Subject: |
Re: Lost partition tables on ide-hd + ahci drive |
Date: |
Thu, 15 Jun 2023 09:04:19 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 |
Am 14.06.23 um 16:48 schrieb Simon J. Rowe:
> On 02/02/2023 12:08, Fiona Ebner wrote:
>> Hi,
>> over the years we've got 1-2 dozen reports[0] about suddenly
>> missing/corrupted MBR/partition tables. The issue seems to be very rare
>> and there was no success in trying to reproduce it yet. I'm asking here
>> in the hope that somebody has seen something similar.
>>
>> The only commonality seems to be the use of an ide-hd drive with ahci
>> bus.
>>
>> It does seem to happen with both Linux and Windows guests (one of the
>> reports even mentions FreeBSD) and backing storages for the VMs include
>> ZFS, RBD, LVM-Thin as well as file-based storages.
>>
>> Relevant part of an example configuration:
>>
>>> -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>>> -drive
>>> 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on'
>>> \
>>> -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
>> The first reports are from before io_uring was used and there are also
>> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>>
>> Some reports say that the issue occurred under high IO load.
>>
>> Many reports suspect backups causing the issue. Our backup mechanism
>> uses backup_job_create() for each drive and runs the jobs sequentially.
>> It uses a custom block driver as the backup target which just forwards
>> the writes to the actual target which can be a file or our backup server.
>> (If you really want to see the details, apply the patches in [1] and see
>> pve-backup.c and block/backup-dump.c).
>>
>> Of course, the backup job will read sector 0 of the source disk, but I
>> really can't see where a stray write would happen, why the issue would
>> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>>
>> So again, just asking if somebody has seen something similar or has a
>> hunch of what the cause might be.
>>
>> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
>> [1]:
>> https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD
>>
>>
> We've also seen a handful of similar reports. Again, just the MBR sector
> overwritten by what looks to be guest data (e.g. log messages). The
> common thread with our incidents is again a SATA disk under the AHCI
> controller, we have a network backend (iSCSI) which has experienced a
> failure.
>
> I've tried to repro this with blkdebug and simulated write errors,
> without success.
>
Hi,
which version/build of QEMU are you using? Can you correlate the issue
with any block job or was the drive in use by the guest only?
Best Regards,
Fiona