[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] gitlab: remove unreliable avocado CI jobs
From: |
Stefan Hajnoczi |
Subject: |
Re: [PATCH] gitlab: remove unreliable avocado CI jobs |
Date: |
Tue, 12 Sep 2023 14:52:40 -0400 |
On Tue, 12 Sept 2023 at 14:36, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Stefan Hajnoczi <stefanha@gmail.com> writes:
>
> > On Tue, Sep 12, 2023, 12:14 Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Sep 12, 2023 at 05:01:26PM +0100, Alex Bennée wrote:
> > >
> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> > >
> > > > On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
> > > >> The avocado-system-alpine, avocado-system-fedora, and
> > > >> avocado-system-ubuntu jobs are unreliable. I identified them while
> > > >> looking over CI failures from the past week:
> > > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> > > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> > > >> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> > > >>
> > > >> Thomas Huth suggest on IRC today that there may be a legitimate
> > failure
> > > >> in there:
> > > >>
> > > >> th_huth: f4bug, yes, seems like it does not start at all correctly
> > on
> > > >> alpine anymore ... and it's broken since ~ 2 weeks already, so if
> > nobody
> > > >> noticed this by now, this is worrying
> > > >>
> > > >> It crept in because the jobs were already unreliable.
> > > >>
> > > >> I don't know how to interpret the job output, so all I can do is to
> > > >> propose removing these jobs. A useful CI job has two outcomes: pass or
> > > >> fail. Timeouts and other in-between states are not useful because they
> > > >> require constant triaging by someone who understands the details of
> > the
> > > >> tests and they can occur when run against pull requests that have
> > > >> nothing to do with the area covered by the test.
> > > >>
> > > >> Hopefully test owners will be able to identify the root causes and
> > solve
> > > >> them so that these jobs can stay. In their current state the jobs are
> > > >> not useful since I cannot cannot tell whether job failures are real or
> > > >> just intermittent when merging qemu.git pull requests.
> > > >>
> > > >> If you are a test owner, please take a look.
> > > >>
> > > >> It is likely that other avocado-system-* CI jobs have similar failures
> > > >> from time to time, but I'll leave them as long as they are passing.
> > > >>
> > > >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> > > >> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > >> ---
> > > >> .gitlab-ci.d/buildtest.yml | 27 ---------------------------
> > > >> 1 file changed, 27 deletions(-)
> > > >>
> > > >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> > > >> index aee9101507..83ce448c4d 100644
> > > >> --- a/.gitlab-ci.d/buildtest.yml
> > > >> +++ b/.gitlab-ci.d/buildtest.yml
> > > >> @@ -22,15 +22,6 @@ check-system-alpine:
> > > >> IMAGE: alpine
> > > >> MAKE_CHECK_ARGS: check-unit check-qtest
> > > >>
> > > >> -avocado-system-alpine:
> > > >> - extends: .avocado_test_job_template
> > > >> - needs:
> > > >> - - job: build-system-alpine
> > > >> - artifacts: true
> > > >> - variables:
> > > >> - IMAGE: alpine
> > > >> - MAKE_CHECK_ARGS: check-avocado
> > > >
> > > > Instead of entirely deleting, I'd suggest adding
> > > >
> > > > # Disabled due to frequent random failures
> > > > # https://gitlab.com/qemu-project/qemu/-/issues/1884
> > > > when: manual
> > > >
> > > > See example: https://docs.gitlab.com/ee/ci/yaml/#when
> > > >
> > > > This disables the job from running unless someone explicitly
> > > > tells it to run
> > >
> > > What I don't understand is why we didn't gate the release back when they
> > > first tripped. We should have noticed between:
> > >
> > > https://gitlab.com/qemu-project/qemu/-/pipelines/956543770
> > >
> > > and
> > >
> > > https://gitlab.com/qemu-project/qemu/-/pipelines/957154381
> > >
> > > that the system tests where regressing. Yet we merged the changes
> > > anyway.
> >
> > I think that green series is misleading, based on Richard's
> > mail on list wrt the TCG pull series:
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg04014.html
> >
> > "It's some sort of timing issue, which sometimes goes away
> > when re-run. I was re-running tests *a lot* in order to
> > get them to go green while running the 8.1 release. "
>
> But I think in that actual case a change exposed a race condition which
> has only recently been fixed - however we've had additional regresssions
> since.
>
> Rather than kill the system tests we can disable the flaky individual
> tests in avocado.
That would be nice, please send an alternative patch.
I can't do that myself because there are a bunch of test cases with
suspicious output and I don't know which ones are legitimate failures,
intermittent problems, or expected failures.
Stefan
>
> >
> > Essentially I'd put this down to the tests being soo non-deterministic
> > that we've given up trusting them.
> >
> > Yes.
> >
> > Stefan
> >
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com -o-
> > https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org -o-
> > https://fstop138.berrange.com :|
> > |: https://entangle-photo.org -o-
> > https://www.instagram.com/dberrange :|
>
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
Re: [PATCH] gitlab: remove unreliable avocado CI jobs, Thomas Huth, 2023/09/12