Re: Timeouts in CI jobs

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Timeouts in CI jobs

From:	Daniel P . Berrangé
Subject:	Re: Timeouts in CI jobs
Date:	Thu, 25 Apr 2024 14:27:17 +0100
User-agent:	Mutt/2.2.12 (2023-09-09)

On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote:
> Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:
> 
> > On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> > > I think the timeouts are caused by running too many parallel processes
> > > during testing.
> > > 
> > > The CI uses parallel builds:
> > > 
> > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
> > Note that command is running both the compile and test phases of
> > the job. Overcommitting CPUs for the compile phase is a good
> > idea to keep CPUs busy while another process is waiting on
> > I/O, and is almost always safe todo.
> 
> 
> Thank you for your answer.
> 
> Overcommitting for the build is safe, but in my experience the positive
> effect is typically very small on modern hosts with fast disk I/O and large
> buffer caches.

Fine with typical developer machines, but the shared runners in
gitlab are fairly resource constrained by comparison, and resources
are often under contention from other VMs in their infra.

> And there is also a negative impact because this requires scheduling with
> process switches.
> 
> Therefore I am not so sure that overcommitting is a good idea, especially
> not on cloud servers where the jobs are running in VMs.

As a point of reference, 'ninja' defaults to '$nproc + 2'.

> > 
> > In the primary QEMU repo, we have a customer runner registered
> > that uses Azure based VMs. Not sure on the resources we have
> > configured for them offhand.
> 
> I was talking about the primary QEMU.
> 
> > The important thing there is that what you see for CI speed in
> > your fork repo is not neccessarily a match for CI speed in QEMU
> > upstream repo.
> 
> I did not run tests in my GitLab fork because I still have to figure out how
> to do that.

It is quite simple:

  git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu
  git push gitlab -o QEMU_CI=2

this immediately runs all pipelines jobs. USe QEMU_CI=1 to not
start any jobs, and let you manually start the subset you are
interested in checking

> My test environment was an older (= slow) VM with 4 cores. I tested with
> different values for --num-processes. As expected higher values raised the
> number of timeouts. And the most interesting result was that
> `--num-processes 1` avoided timeouts, used less CPU time and did not
> increase the duration.
> 
> > > In my tests setting --num-processes to a lower value not only avoided
> > > timeouts but also reduced the processing overhead without increasing the
> > > runtime.
> > > 
> > > Could we run all tests with `--num-processes 1`?
> > The question is what impact that has on the overall job execution
> > time. A lot of our jobs are already quite long, which is bad for
> > the turnaround time of CI testing.  Reliable CI though is arguably
> > the #1 priority though, otherwise developers cease trusting it.
> > We need to find the balance between avoiding timeouts, while having
> > the shortest practical job time.  The TCI job you show about came
> > out at 22 minutes, which is not our worst job, so there is some
> > scope for allowing it to run longer with less parallelism.
> 
> The TCI job terminates after less than 7 minutes in my test runs with less
> parallelism.
> 
> Obviously there are tests which already do their own multithreading, and
> maybe other tests run single threaded. So maybe we need different values for
> `--num-processes` depending on the number of threads which the single tests
> use?

QEMU has differnt test suites too. The unit tests are likely safe
to run fully parallel, but the block I/O tests and qtests are likely
to benefit from serialization, since they all spawn many QEMU processes
as children that will consume multiple CPUs, so we probably don't need
to run the actually test suite in parallel to max out the CPUs. Still
needs testing under gitlab CI to prove the theory.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

[Prev in Thread]

Current Thread

[Next in Thread]

cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Peter Maydell, 2024/04/16
- Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Stefan Weil, 2024/04/16
  - Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Stefan Weil, 2024/04/20
    - Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Stefan Weil, 2024/04/24
    - Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Daniel P . Berrangé, 2024/04/24
    - Re: Timeouts in CI jobs, Stefan Weil, 2024/04/24
    - Re: Timeouts in CI jobs, Daniel P . Berrangé <=
    - Re: Timeouts in CI jobs, Daniel P . Berrangé, 2024/04/25

Prev by Date: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name
Next by Date: Re: [PATCH] migration/ram.c: API Conversion qemu_mutex_lock(), and qemu_mutex_unlock() to WITH_QEMU_LOCK_GUARD macro
Previous by thread: Re: Timeouts in CI jobs
Next by thread: Re: Timeouts in CI jobs
Index(es):
- Date
- Thread