[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 0/4] colo: Introduce resource agent and high-level test
From: |
Lukas Straub |
Subject: |
Re: [PATCH 0/4] colo: Introduce resource agent and high-level test |
Date: |
Wed, 27 Nov 2019 22:11:34 +0100 |
On Fri, 22 Nov 2019 09:46:46 +0000
"Dr. David Alan Gilbert" <address@hidden> wrote:
> * Lukas Straub (address@hidden) wrote:
> > Hello Everyone,
> > These patches introduce a resource agent for use with the Pacemaker CRM and
> > a
> > high-level test utilizing it for testing qemu COLO.
> >
> > The resource agent manages qemu COLO including continuous replication.
> >
> > Currently the second test case (where the peer qemu is frozen) fails on
> > primary
> > failover, because qemu hangs while removing the replication related block
> > nodes.
> > Note that this also happens in real world test when cutting power to the
> > peer
> > host, so this needs to be fixed.
>
> Do you understand why that happens? Is this it's trying to finish a
> read/write to the dead partner?
>
> Dave
I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
while removing the replication blockdev and of course thats probably because the
nbd client waits for a reply. So I tried with the workaround below, which will
actively kill the TCP connection and with it the test passes, though I haven't
tested it in real world yet.
A proper solution to this would probably be a "force" parameter for
blockdev-del,
which skips all flushing and aborts all inflight io. Or we could add a timeout
to the nbd client.
Regards,
Lukas Straub
diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo
index 5fd9cfc0b5..62210af2a1 100755
--- a/scripts/colo-resource-agent/colo
+++ b/scripts/colo-resource-agent/colo
@@ -935,6 +935,7 @@ def qemu_colo_notify():
and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
fd = qmp_open()
peer = qmp_get_nbd_remote(fd)
+ os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT))
if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
if qmp_check_resync(fd) != None:
qmp_cancel_resync(fd)
- [PATCH 0/4] colo: Introduce resource agent and high-level test, Lukas Straub, 2019/11/21
- [PATCH 3/4] colo: Introduce high-level test, Lukas Straub, 2019/11/21
- [PATCH 4/4] MAINTAINERS: Add myself as maintainer for COLO resource agent, Lukas Straub, 2019/11/21
- [PATCH 2/4] colo: Introduce resource agent, Lukas Straub, 2019/11/21
- Re: [PATCH 0/4] colo: Introduce resource agent and high-level test, Dr. David Alan Gilbert, 2019/11/22
- Re: [PATCH 0/4] colo: Introduce resource agent and high-level test,
Lukas Straub <=