[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 0/4] colo: Introduce resource agent and high-level test
From: |
Lukas Straub |
Subject: |
Re: [PATCH 0/4] colo: Introduce resource agent and high-level test |
Date: |
Wed, 18 Dec 2019 10:27:11 +0100 |
On Wed, 27 Nov 2019 22:11:34 +0100
Lukas Straub <address@hidden> wrote:
> On Fri, 22 Nov 2019 09:46:46 +0000
> "Dr. David Alan Gilbert" <address@hidden> wrote:
>
> > * Lukas Straub (address@hidden) wrote:
> > > Hello Everyone,
> > > These patches introduce a resource agent for use with the Pacemaker CRM
> > > and a
> > > high-level test utilizing it for testing qemu COLO.
> > >
> > > The resource agent manages qemu COLO including continuous replication.
> > >
> > > Currently the second test case (where the peer qemu is frozen) fails on
> > > primary
> > > failover, because qemu hangs while removing the replication related block
> > > nodes.
> > > Note that this also happens in real world test when cutting power to the
> > > peer
> > > host, so this needs to be fixed.
> >
> > Do you understand why that happens? Is this it's trying to finish a
> > read/write to the dead partner?
> >
> > Dave
>
> I haven't looked into it too closely yet, but it's often hanging in
> bdrv_flush()
> while removing the replication blockdev and of course thats probably because
> the
> nbd client waits for a reply. So I tried with the workaround below, which will
> actively kill the TCP connection and with it the test passes, though I haven't
> tested it in real world yet.
>
In the real cluster, sometimes qemu even hangs while connecting to qmp (after
remote
poweroff). But I currently don't have the time to look into it.
Still a failing test is better than no test. Could we mark this test as
known-bad and
fix this issue later? How should I mark it as known-bad? By tag? Or warn in the
log?
Regards,
Lukas Straub
- Re: [PATCH 0/4] colo: Introduce resource agent and high-level test,
Lukas Straub <=