qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iotest 030 SIGSEGV


From: Vladimir Sementsov-Ogievskiy
Subject: Re: iotest 030 SIGSEGV
Date: Fri, 15 Oct 2021 18:24:00 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0

15.10.2021 12:38, Paolo Bonzini wrote:
On 14/10/21 18:14, Vladimir Sementsov-Ogievskiy wrote:

iotest 30 failing is a long story.. And as I understand the main source of all 
these crashes is that we do diffreent graph modifications simultaneously from 
parallel block jobs.

In past I sent RFC series with global mutext, to fix a subset of the problem: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/ [just look at patch 5: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/20201120161622.1537-6-vsementsov@virtuozzo.com/]

Can you explain the way they interleave, and where the job callbacks are 
yielding in the middle of graph manipulations?

Not exactly, and I don't think it worth to recover this concrete old problem 
about permissions: too much changes since it were made, especially in block 
permission update system.

So, I can only refer to my old comments on it:

  OK, after some debugging and looking at block-graph dumps I tend to think that
  this a race between finish (.prepare) of mirror and block-stream. They do 
graph
  updates. Nothing prevents interleaving of graph-updating operations (note that
  bdrv_replace_child_noperm may do aio_poll). And nothing protects two processes
  of graph-update from intersection.

and

  aio_poll at start of mirror_exit_common is my addition. But anyway the problem
  is here: we do call mirror_prepare inside of stream_prepare!

(https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05181.html)

At this link there is a good core dump, where we are in stream_prepare, we do 
bdrv_change_backing_file() which finally call bdrv_pwritev(), which lead to 
aio_poll, during which we switch to mirror_prepare().

and

  2. The core problem is that graph modification functions may trigger
  context-switch due to nested aio_polls.. which leads to (for example) nested
  permission updates..

(https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05212.html)


The problem with a CoMutex is that plenty of graph manipulations happen outside 
coroutines, and if coroutines such as stream_co_clean yield the monitor can do 
graph manipulations of its own.

So if the solution could be "no yielding in the middle of graph manipulations", that 
would be much better.  In fact, maybe the coroutine API should have a way to assert "no-yield 
regions" (similar to how Linux croaks if you call a sleeping function while preemption is 
disabled). More assertions = more bugs found early.

Not sure that it's possible to fix bdrv_change_backing_file() in this way..  If 
some graph modifications connected with updating metadata in images, we want to 
write data and therefore - to yield. And the whole operation, both updating 
metadata in the image and updating the graph should be protected from 
interleaving with another graph-modifying operation.


Not sure was it good enough to try to recover it. I didn't look close at Emanuele's 
"block layer: split block APIs in global state and I/O". Wasn't there something 
on protecting graph operations?

In his series, graph operations are supposed to operate from the main thread 
(which they do) but he didn't cover the case of coroutines that yield.

Paolo


Maybe, it's possible to develop a critical section, a kind of mutex, that can 
be used both inside and outside of coroutines? So that coroutine will yield 
until it can acquire the mutex, non-coroutine will poll.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]