Re: [RFC] QEMU Gating CI

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] QEMU Gating CI

From:	Cleber Rosa
Subject:	Re: [RFC] QEMU Gating CI
Date:	Thu, 5 Dec 2019 00:05:37 -0500
User-agent:	Mutt/1.12.1 (2019-06-15)

On Tue, Dec 03, 2019 at 05:54:38PM +0000, Peter Maydell wrote:
> On Mon, 2 Dec 2019 at 14:06, Cleber Rosa <address@hidden> wrote:
> >
> > RFC: QEMU Gating CI
> > ===================
> >
> > This RFC attempts to address most of the issues described in
> > "Requirements/GatinCI"[1].  An also relevant write up is the "State of
> > QEMU CI as we enter 4.0"[2].
> >
> > The general approach is one to minimize the infrastructure maintenance
> > and development burden, leveraging as much as possible "other people's"
> > infrastructure and code.  GitLab's CI/CD platform is the most relevant
> > component dealt with here.
> 
> Thanks for writing up this RFC.
> 
> My overall view is that there's some interesting stuff in
> here and definitely some things we'll want to cover at some
> point, but there's also a fair amount that is veering away
> from solving the immediate problem we want to solve, and
> which we should thus postpone for later (beyond making some
> reasonable efforts not to design something which paints us
> into a corner so it's annoyingly hard to improve later).
>

Right.  I think this is a valid perspective to consider as we define
the order and scope of thanks.  I'll follow up with a more
straightforward suggestion with the bare minimum actions for a first
round.

> > To exemplify my point, if one specific test run as part of "check-tcg"
> > is found to be faulty on a specific job (say on a specific OS), the
> > entire "check-tcg" test set may be disabled as a CI-level maintenance
> > action.  Of course a follow up action to deal with the specific test
> > is required, probably in the form of a Launchpad bug and patches
> > dealing with the issue, but without necessarily a CI related angle to
> > it.
> >
> > If/when test result presentation and control mechanism evolve, we may
> > feel confident and go into finer grained granularity.  For instance, a
> > mechanism for disabling nothing but "tests/migration-test" on a given
> > environment would be possible and desirable from a CI management level.
> 
> For instance, we don't have anything today for granularity of
> definition of what tests we run where or where we disable them.
> So we don't need it in order to move away from the scripting
> approach I have at the moment. We can just say "the CI system
> will run make and make check (and maybe in some hosts some
> additional test-running commands) on these hosts" and hardcode
> that into whatever yaml file the CI system's configured in.
>

I absolutely agree.  That's why I even considered *if* this will done,
and not only *when*.  Because I happen to be biased from working on a
test runner/framework, this is something that I had to at least talk
about, so that it can be evaluated and maybe turned into a goal.

> > Pre-merge
> > ~~~~~~~~~
> >
> > The natural way to have pre-merge CI jobs in GitLab is to send "Merge
> > Requests"[3] (abbreviated as "MR" from now on).  In most projects, a
> > MR comes from individual contributors, usually the authors of the
> > changes themselves.  It's my understanding that the current maintainer
> > model employed in QEMU will *not* change at this time, meaning that
> > code contributions and reviews will continue to happen on the mailing
> > list.  A maintainer then, having collected a number of patches, would
> > submit a MR either in addition or in substitution to the Pull Requests
> > sent to the mailing list.
> 
> Eventually it would be nice to allow any submaintainer
> to send a merge request to the CI system (though you would
> want it to have a "but don't apply until somebody else approves it"
> gate as well as the automated testing part). But right now all
> we need is for the one person managing merges and releases
> to be able to say "here's the branch where I merged this
> pullrequest, please test it". At any rate, supporting multiple
> submaintainers all talking to the CI independently should be
> out of scope for now.
>

OK, noted.

> > Multi-maintainer model
> > ~~~~~~~~~~~~~~~~~~~~~~
> >
> > The previous section already introduced some of the proposed workflow
> > that can enable such a multi-maintainer model.  With a Gating CI
> > system, though, it will be natural to have a smaller "Mean time
> > between (CI) failures", simply because of the expected increased
> > number of systems and checks.  A lot of countermeasures have to be
> > employed to keep that MTBF in check.
> >
> > For once, it's imperative that the maintainers for such systems and
> > jobs are clearly defined and readily accessible.  Either the same
> > MAINTAINERS file or a more suitable variation of such data should be
> > defined before activating the *gating* rules.  This would allow a
> > routing to request the attention of the maintainer responsible.
> >
> > In case of unresposive maintainers, or any other condition that
> > renders and keeps one or more CI jobs failing for a given previously
> > established amount of time, the job can be demoted with an
> > "allow_failure" configuration[7].  Once such a change is commited, the
> > path to promotion would be just the same as in a newly added job
> > definition.
> >
> > Note: In a future phase we can evaluate the creation of rules that
> > look at changed paths in a MR (similar to "F:" entries on MAINTAINERS)
> > and the execution of specific CI jobs, which would be the
> > responsibility of a given maintainer[8].
> 
> All this stuff is not needed to start with. We cope at the
> moment with "everything is gating, and if something doesn't
> pass it needs to be fixed or manually removed from the setup".
>

OK, I get your point.  But, I think it's fair to say though, that one
big motivation that we also have for this work, is to be able to
provide new machines and jobs into the Gating CI in the very near
future.  And to do that, we must set common rules so that anyone
else can do the same and abide by the same terms.

> > GitLab Jobs and Pipelines
> > -------------------------
> >
> > GitLab CI is built around two major concepts: jobs and pipelines.  The
> > current GitLab CI configuration in QEMU uses jobs only (or putting it
> > another way, all jobs in a single pipeline stage).  Consider the
> > folowing job definition[9]:
> >
> >    build-tci:
> >     script:
> >     - TARGETS="aarch64 alpha arm hppa m68k microblaze moxie ppc64 s390x 
> > x86_64"
> >     - ./configure --enable-tcg-interpreter
> >          --target-list="$(for tg in $TARGETS; do echo -n ${tg}'-softmmu '; 
> > done)"
> >     - make -j2
> >     - make tests/boot-serial-test tests/cdrom-test tests/pxe-test
> >     - for tg in $TARGETS ; do
> >         export QTEST_QEMU_BINARY="${tg}-softmmu/qemu-system-${tg}" ;
> >         ./tests/boot-serial-test || exit 1 ;
> >         ./tests/cdrom-test || exit 1 ;
> >       done
> >     - QTEST_QEMU_BINARY="x86_64-softmmu/qemu-system-x86_64" ./tests/pxe-test
> >     - QTEST_QEMU_BINARY="s390x-softmmu/qemu-system-s390x" ./tests/pxe-test 
> > -m slow
> >
> > All the lines under "script" are performed sequentially.  It should be
> > clear that there's the possibility of breaking this down into multiple
> > stages, so that a build happens first, and then "common set of tests"
> > run in parallel.
> 
> We could do this, but we don't do it today, so we don't need
> to think about this at all to start with.
>

So, in your opinion, this is phase >= 1 material.  Noted.

> > In theory, there's nothing that prevents an entire QEMU build
> > directory, to be treated as an artifact.  In practice, there are
> > predefined limits on GitLab that prevents that from being possible,
> 
> ...so we don't need to worry about somehow defining some
> cut-down "build artefact" that we provide to the testing
> phase. Just do a build and test run as a single thing.
> We can always come back and improve later.
> 
> 
> Have you been able to investigate and confirm that we can
> get a gitlab-runner setup that works on non-x86 ? That seems
> to me like an important thing we should be confident about
> early before we sink too much effort into a gitlab-based
> solution.
>

I've successfully built gitlab-runner and run jobs on aarch64, ppc64le
and s390x.  The binaries are available here:

   https://cleber.fedorapeople.org/gitlab-runner/v12.4.1/

But, with the "shell" executor (given that Docker helper images are
not available for those architectures).  I don't think we'd have to
depend on GitLab providing those images though, it should be possible
to create them for different architectures and tweak the gitlab-runner
code to use different image references on those architectures.

Does this answer this specific question?

Best,
- Cleber.

> thanks
> -- PMM
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC] QEMU Gating CI, (continued)
- Re: [RFC] QEMU Gating CI, Alex Bennée, 2019/12/03
  - Re: [RFC] QEMU Gating CI, Thomas Huth, 2019/12/04
  - Re: [RFC] QEMU Gating CI, Cleber Rosa, 2019/12/06
- Re: [RFC] QEMU Gating CI, Peter Maydell, 2019/12/03
  - Re: [RFC] QEMU Gating CI, Cleber Rosa <=

Prev by Date: Re: [PATCH v2 2/4] target/arm: Abstract the generic timer frequency
Next by Date: Re: [PATCH v17 6/7] migration: Include migration support for machine check handling
Previous by thread: Re: [RFC] QEMU Gating CI
Next by thread: Re: [bugfix ping2] Re: [PATCH v2 0/2] fix qcow2_can_store_new_dirty_bitmap
Index(es):
- Date
- Thread