Re: [RFC] QEMU Gating CI

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] QEMU Gating CI

From:	Peter Maydell
Subject:	Re: [RFC] QEMU Gating CI
Date:	Tue, 3 Dec 2019 17:54:38 +0000

On Mon, 2 Dec 2019 at 14:06, Cleber Rosa <address@hidden> wrote:
>
> RFC: QEMU Gating CI
> ===================
>
> This RFC attempts to address most of the issues described in
> "Requirements/GatinCI"[1].  An also relevant write up is the "State of
> QEMU CI as we enter 4.0"[2].
>
> The general approach is one to minimize the infrastructure maintenance
> and development burden, leveraging as much as possible "other people's"
> infrastructure and code.  GitLab's CI/CD platform is the most relevant
> component dealt with here.

Thanks for writing up this RFC.

My overall view is that there's some interesting stuff in
here and definitely some things we'll want to cover at some
point, but there's also a fair amount that is veering away
from solving the immediate problem we want to solve, and
which we should thus postpone for later (beyond making some
reasonable efforts not to design something which paints us
into a corner so it's annoyingly hard to improve later).

> To exemplify my point, if one specific test run as part of "check-tcg"
> is found to be faulty on a specific job (say on a specific OS), the
> entire "check-tcg" test set may be disabled as a CI-level maintenance
> action.  Of course a follow up action to deal with the specific test
> is required, probably in the form of a Launchpad bug and patches
> dealing with the issue, but without necessarily a CI related angle to
> it.
>
> If/when test result presentation and control mechanism evolve, we may
> feel confident and go into finer grained granularity.  For instance, a
> mechanism for disabling nothing but "tests/migration-test" on a given
> environment would be possible and desirable from a CI management level.

For instance, we don't have anything today for granularity of
definition of what tests we run where or where we disable them.
So we don't need it in order to move away from the scripting
approach I have at the moment. We can just say "the CI system
will run make and make check (and maybe in some hosts some
additional test-running commands) on these hosts" and hardcode
that into whatever yaml file the CI system's configured in.

> Pre-merge
> ~~~~~~~~~
>
> The natural way to have pre-merge CI jobs in GitLab is to send "Merge
> Requests"[3] (abbreviated as "MR" from now on).  In most projects, a
> MR comes from individual contributors, usually the authors of the
> changes themselves.  It's my understanding that the current maintainer
> model employed in QEMU will *not* change at this time, meaning that
> code contributions and reviews will continue to happen on the mailing
> list.  A maintainer then, having collected a number of patches, would
> submit a MR either in addition or in substitution to the Pull Requests
> sent to the mailing list.

Eventually it would be nice to allow any submaintainer
to send a merge request to the CI system (though you would
want it to have a "but don't apply until somebody else approves it"
gate as well as the automated testing part). But right now all
we need is for the one person managing merges and releases
to be able to say "here's the branch where I merged this
pullrequest, please test it". At any rate, supporting multiple
submaintainers all talking to the CI independently should be
out of scope for now.

> Multi-maintainer model
> ~~~~~~~~~~~~~~~~~~~~~~
>
> The previous section already introduced some of the proposed workflow
> that can enable such a multi-maintainer model.  With a Gating CI
> system, though, it will be natural to have a smaller "Mean time
> between (CI) failures", simply because of the expected increased
> number of systems and checks.  A lot of countermeasures have to be
> employed to keep that MTBF in check.
>
> For once, it's imperative that the maintainers for such systems and
> jobs are clearly defined and readily accessible.  Either the same
> MAINTAINERS file or a more suitable variation of such data should be
> defined before activating the *gating* rules.  This would allow a
> routing to request the attention of the maintainer responsible.
>
> In case of unresposive maintainers, or any other condition that
> renders and keeps one or more CI jobs failing for a given previously
> established amount of time, the job can be demoted with an
> "allow_failure" configuration[7].  Once such a change is commited, the
> path to promotion would be just the same as in a newly added job
> definition.
>
> Note: In a future phase we can evaluate the creation of rules that
> look at changed paths in a MR (similar to "F:" entries on MAINTAINERS)
> and the execution of specific CI jobs, which would be the
> responsibility of a given maintainer[8].

All this stuff is not needed to start with. We cope at the
moment with "everything is gating, and if something doesn't
pass it needs to be fixed or manually removed from the setup".

> GitLab Jobs and Pipelines
> -------------------------
>
> GitLab CI is built around two major concepts: jobs and pipelines.  The
> current GitLab CI configuration in QEMU uses jobs only (or putting it
> another way, all jobs in a single pipeline stage).  Consider the
> folowing job definition[9]:
>
>    build-tci:
>     script:
>     - TARGETS="aarch64 alpha arm hppa m68k microblaze moxie ppc64 s390x 
> x86_64"
>     - ./configure --enable-tcg-interpreter
>          --target-list="$(for tg in $TARGETS; do echo -n ${tg}'-softmmu '; 
> done)"
>     - make -j2
>     - make tests/boot-serial-test tests/cdrom-test tests/pxe-test
>     - for tg in $TARGETS ; do
>         export QTEST_QEMU_BINARY="${tg}-softmmu/qemu-system-${tg}" ;
>         ./tests/boot-serial-test || exit 1 ;
>         ./tests/cdrom-test || exit 1 ;
>       done
>     - QTEST_QEMU_BINARY="x86_64-softmmu/qemu-system-x86_64" ./tests/pxe-test
>     - QTEST_QEMU_BINARY="s390x-softmmu/qemu-system-s390x" ./tests/pxe-test -m 
> slow
>
> All the lines under "script" are performed sequentially.  It should be
> clear that there's the possibility of breaking this down into multiple
> stages, so that a build happens first, and then "common set of tests"
> run in parallel.

We could do this, but we don't do it today, so we don't need
to think about this at all to start with.

> In theory, there's nothing that prevents an entire QEMU build
> directory, to be treated as an artifact.  In practice, there are
> predefined limits on GitLab that prevents that from being possible,

...so we don't need to worry about somehow defining some
cut-down "build artefact" that we provide to the testing
phase. Just do a build and test run as a single thing.
We can always come back and improve later.

Have you been able to investigate and confirm that we can
get a gitlab-runner setup that works on non-x86 ? That seems
to me like an important thing we should be confident about
early before we sink too much effort into a gitlab-based
solution.

thanks
-- PMM

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC] QEMU Gating CI, (continued)
- Re: [RFC] QEMU Gating CI, Stefan Hajnoczi, 2019/12/02
  - Re: [RFC] QEMU Gating CI, Peter Maydell, 2019/12/02
    - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/02
    - Re: [RFC] QEMU Gating CI, Warner Losh, 2019/12/02
    - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/02
  - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/02
    - Re: [RFC] QEMU Gating CI, Stefan Hajnoczi, 2019/12/03
- Re: [RFC] QEMU Gating CI, Alex Bennée, 2019/12/03
  - Re: [RFC] QEMU Gating CI, Thomas Huth, 2019/12/04
  - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/06
- Re: [RFC] QEMU Gating CI, Peter Maydell <=
  - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/05

Prev by Date: for 4.2 ??? Re: [PATCH V3 2/2] block/nbd: fix memory leak in nbd_open()
Next by Date: Re: [PATCH v3 2/4] s390x: Add missing vcpu reset functions
Previous by thread: Re: [RFC] QEMU Gating CI
Next by thread: Re: [RFC] QEMU Gating CI
Index(es):
- Date
- Thread