[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
From: |
Alex Bennée |
Subject: |
Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg) |
Date: |
Fri, 23 Sep 2022 11:08:00 +0100 |
User-agent: |
mu4e 1.9.0; emacs 28.2.50 |
Thomas Huth <thuth@redhat.com> writes:
> On 23/09/2022 09.28, Daniel P. Berrangé wrote:
>> On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote:
>>> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
>>> know why. I think it's due to timeouts but maybe there is something
>>> buried in the logs that I missed.
>>>
>>> I waste time skimming through logs when merging qemu.git pull requests
>>> and electricity is wasted on tests that don't produce useful pass/fail
>>> output.
>>>
>>> Here are two recent examples:
>>> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
>>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583
>>>
>>> If there are real test failures then the test output needs to be
>>> improved so people can identify failures.
>>>
>>> If the tests are timing out then they need to be split up and/or reduced
>>> in duration. BTW, if it's a timeout, why are we using an internal
>>> timeout instead of letting CI mark the job as timed out?
>>>
>>> Any other ideas for improving these CI jobs?
>> The avocado job there does show the errors, but the summary at the
>> end leaves something to be desired. At first glance it looked like
>> everything passed because it says "ERROR 0" and that's what caught
>> my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually
>> just an error state too. I don't understand why it has to have so
>> many different ways of saying the same thing:
>> RESULTS : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 |
>> INTERRUPT 5 | CANCEL 136
>> "ERROR", "FAIL" and "INTERRUPT" are all just the same thing
>> "SKIP" and "CANCEL" are just the same thing
>> I'm sure there was some reason for these different terms, but IMHO
>> they
>> are actively unhelpful.
>> For example I see no justiable reason for the choice of SKIP vs
>> CANCEL
>> in these two messages:
>> (173/192)
>> tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up:
>> SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such
>> file or directory: 'sudo'
>> (183/192)
>> tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias:
>> CANCEL: No QEMU binary defined or found in the build tree (0.00 s)
>> It would be clearer to understand the summary as:
>> RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0
>> I'd also like to see it repeat the error messages for the failed
>> tests at the end, so you don't have to search back up through the
>> huge log to find them.
>> On the TCG tests we see
>> imeout --foreground 90
>> /home/travis/build/qemu-project/qemu/build/qemu-s390x noexec >
>> noexec.out
>> make[1]: *** [../Makefile.target:158: run-noexec] Error 1
>> make[1]: Leaving directory
>> '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user'
>> make: ***
>> [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60:
>> run-tcg-tests-s390x-linux-user] Error 2
>> I presume that indicates the 'noexec' test failed, but we have zero
>> info.
>
> I think this is the bug that will be fixed by Ilya's patch here:
>
> https://lists.gnu.org/archive/html/qemu-devel/2022-09/msg02756.html
>
> But I agree, it is unfortunate that the output is not available.
> Looking at this on my s390x box:
>
> $ cat tests/tcg/s390x-linux-user/noexec.out
> [ RUN ] fallthrough
> [ OK ]
> [ RUN ] jump
> [ FAILED ] unexpected SEGV
>
> so there is an indication of what's going wrong in there indeed.
>
> Alex, would it be possible to change the tcg test harness to dump the
> .out file of failing tests?
Yes I think so, either by tweaking the run-% rules in
tests/tcg/Makefile.target to handle a failed call or possibly expanding
the run-test macro itself.
>
> Thomas
--
Alex Bennée