qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [for-6.0] Non-deterministic qemu-iotests qsd-jobs failures


From: John Snow
Subject: Re: [for-6.0] Non-deterministic qemu-iotests qsd-jobs failures
Date: Tue, 30 Mar 2021 13:00:57 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 3/30/21 10:38 AM, Stefan Hajnoczi wrote:
On Mon, Mar 29, 2021 at 05:43:28PM +0200, Max Reitz wrote:
On 29.03.21 17:05, Stefan Hajnoczi wrote:
Hi Kevin,
Peter hit an s390 qemu-iotests failure. I was able to reproduce it
easily. I haven't checked if it reproduce on other platforms too, but it
seems likely. Here is what I found.

qsd-jobs has race conditions:

    # Just make sure that this doesn't crash
    $QSD --chardev stdio,id=stdio --monitor chardev=stdio \
        --blockdev node-name=file0,driver=file,filename="$TEST_IMG" \
        --blockdev node-name=fmt0,driver=qcow2,file=file0 <<EOF | _filter_qmp
    {"execute":"qmp_capabilities"}
    {"execute": "block-commit", "arguments": {"device": "fmt0", "job-id": 
"job0"}}
    {"execute": "quit"}
    EOF

The intent is for "quit" to run while job0 still exists. This proves
that the program does not abort with an assertion failure when "quit" is
called while there are block jobs. bdrv_close_all() expects there to be
no jobs.

Here is the failure that reproduces 1 out of 3 times:

    # build/tests/qemu-iotests/check -qcow2 qsd-jobs
    QEMU          -- 
"/root/qemu/build/tests/qemu-iotests/../../qemu-system-x86_64" -nodefaults 
-display none -accel qtest
    QEMU_IMG      -- "/root/qemu/build/tests/qemu-iotests/../../qemu-img"
    QEMU_IO       -- "/root/qemu/build/tests/qemu-iotests/../../qemu-io" 
--cache writeback --aio threads -f qcow2
    QEMU_NBD      -- "/root/qemu/build/tests/qemu-iotests/../../qemu-nbd"
    IMGFMT        -- qcow2
    IMGPROTO      -- file
    PLATFORM      -- Linux/s390x
    TEST_DIR      -- /root/qemu/scratch
    SOCK_DIR      -- /tmp/tmpapj5dydm
    SOCKET_SCM_HELPER -- /root/qemu/build/tests/qemu-iotests/socket_scm_helper

    qsd-jobs   fail       [10:42:23] [10:42:23]   0.1s   (last: 0.1s)  output 
mismatch (see qsd-jobs.out.bad)
    --- /root/qemu/tests/qemu-iotests/tests/qsd-jobs.out
    +++ qsd-jobs.out.bad
    @@ -9,11 +9,11 @@
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "created", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "running", "id": "job0"}}
     {"return": {}}
    +{"return": {}}
    +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "paused", "id": "job0"}}
    +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "running", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "ready", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_READY", "data": {"device": "job0", 
"len": 0, "offset": 0, "speed": 0, "type": "commit"}}
    -{"return": {}}
    -{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "standby", "id": "job0"}}
    -{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "ready", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "waiting", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "pending", "id": "job0"}}
     {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": 
"job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
    Failures: qsd-jobs
    Failed 1 of 1 iotests

The timing of the QMP events seems to be non-deterministic relative to
the QMP command completions.

It's more complicated than simply lines moving up/down in the expected
output: the job state transitions are slightly different (running ->
ready vs running -> paused).

Should the test case filter out all JOB_* and BLOCK_JOB_* events?

Is there a way to make this test deterministic?

Would writing data to the overlay and setting a very restrictive speed help,
so it doesn’t immediately go into the 'ready' state?  (See attachment.)

Otherwise, I think we could take my “Allow using the QSD” patch
(https://lists.nongnu.org/archive/html/qemu-block/2021-03/msg00654.html) to
rewrite the patch to use common.qemu to test QSD.  With that we could issue
the block-commit command, wait for BLOCK_JOB_READY, and only then issue
'quit'.

Ideally we would have the block job completely controlled by the test.
Either using blkdebug to suspend an I/O request or by having a test job
that is designed for entering specific states and waiting for the test.

I think both your ideas are an improvement on the current test and am
happy with either. Fully controlling job execution isn't simple so I
think it's okay to merge a smaller solution even if it relies on
internals or is still a race condition (but a really long one!).

Stefan


I have been following this only loosely, why exactly is the test non-deterministic? I suppose because we don't know what state the job will be in by the time we are able to submit the QUIT command, which makes the state transition events different.

There's a little hook we use for transactions to start jobs in the PAUSED state and never let them actually run at all. I think all jobs use it, but we unpause singly-issued jobs right away. Can we utilize this to add a debug mode that allows us to create jobs in the paused state from the get-go?

(I'd suggest adding pause as a transaction, but commit isn't supported for those either ...)

As long as the job takes the necessary locks/permissions/etc before it yields at the first pause point, it might be good enough to test what you want.

It might mean yet-another parameter to feed to all the job creation commands... I did at one point in time really want to make a generic "job_create type=commit ..." style command, but I recall finding it difficult to work out how the per-command arguments would be validated from the central level.

Ehm. The driving and testing of jobs is important, but I think everything I can think of is a bit more work than just filtering some of the responses for now.

--js




reply via email to

[Prev in Thread] Current Thread [Next in Thread]