Re: Network connection with COLO VM

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Network connection with COLO VM

From:	Zhang, Chen
Subject:	Re: Network connection with COLO VM
Date:	Wed, 4 Dec 2019 16:32:37 +0800
User-agent:	Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2


On 12/3/2019 9:25 PM, Dr. David Alan Gilbert wrote:

* Daniel Cho (address@hidden) wrote:

Hi Dave,

We could use the exist interface to add netfilter and chardev, it might not
have the problem you said.

However, the netfilter and chardev on the primary at the start, that means
we could not dynamic set COLO
feature to VM?

I wasn't expecting that to be possible - I'd expect you to be able
to start in a state that looks the same as a primary+failed secondary;
but I'm not sure.


Current COLO (with Lukas's patch) can support dynamic set COLO system.

This status is same like the secondary has triggered failover, theprimary node need to find new secondary


node to combine new COLO system.

We try to change this chardev to prevent primary VM will stuck to wait
secondary VM.

-chardev socket,id=compare1,host=127.0.0.1,port=9004,server,wait \

to

-chardev socket,id=compare1,host=127.0.0.1,port=9004,server,nowait \

But it will make primary VM's network not works. (Can't get ip), until
starting connect with secondary VM.


I think you need to check the port 9004 if already connect to the pair node.

I'm not sure of the answer to this; I've not tried doing it - I'm not
sure anyone has!
But, the colo components do track the state of tcp connections, so I'm
expecting that they have to already exist to have the state of those
connections available for when you start the secondary.


Yes, you are right.

For this status, we don't need to sync the state of tcp connections,because after failover

(or just solo COLO primary node), we have empty all the tcp connectionsstate in COLO module.

In the processing of build new COLO pair, we will sync all the VM stateto secondary node and re-build


new track things in COLO module.

Otherwise, the primary VM with netfileter / chardev and without netfilter /
chardev , they takes very different
booting time.
Without  netfilter / chardev : about 1 mins
With   netfilter / chardev : about 5 mins
Is this an issue?

that sounds like it needs investigating.

Dave

Yes, In previous COLO use cases, we need make primary node and secondarynode boot in the same time.


I did't expect such a big difference on netfilter/chardev.

I think you can try without netfilter/chardev, after VM boot, re-buildthe netfilter/chardev related work with chardev server nowait.



Thanks

Zhang Chen

Best regards,
Daniel Cho


Dr. David Alan Gilbert <address@hidden> 於 2019年12月2日 週一 下午5:58寫道：

* Daniel Cho (address@hidden) wrote:

Hi Zhang,

We use qemu-4.1.0 release on this case.

I think we need use block mirror to sync the disk to secondary node

first,

then stop the primary VM and build COLO system.

In the stop moment, you need add some netfilter and chardev socket node

for

COLO, maybe you need re-check this part.


Our test was already follow those step. Maybe I could describe the detail
of the test flow and issues.


Step 1:

Create primary VM without any netfilter and chardev for COLO, and using
other host ping primary VM continually.


Step 2:

Create secondary VM (the same device/drive with primary VM), and do block
mirror sync ( ping to primary VM normally )


Step 3:

After block mirror sync finish, add those netfilter and chardev to

primary

VM and secondary VM for COLO ( *Can't* ping to primary VM but those

packets

will be received later )


Step 4:

Start migrate primary VM to secondary VM, and primary VM & secondary VM

are

running ( ping to primary VM works and receive those packets on step 3
status )




Between Step 3 to Step 4, it will take 10~20 seconds in our environment.

I could image this issue (delay reply packets) is because of setting COLO
proxy for temporary status,

but we thought 10~20 seconds might a little long. (If primary VM is

already

doing some jobs, it might lose the data.)


Could we reduce those time? or those delay is depends on different VM?

I think you need to set up the netfilter and chardev on the primary at
the start;  the filter contains the state of the TCP connections working
with the VM, so adding it later can't gain that state for existing
connections.

Dave

Best Regard,

Daniel Cho.



Zhang, Chen <address@hidden> 於 2019年11月30日 週六 上午2:04寫道：




*From:* Daniel Cho <address@hidden>
*Sent:* Friday, November 29, 2019 10:43 AM
*To:* Zhang, Chen <address@hidden>
*Cc:* Dr. David Alan Gilbert <address@hidden>;

address@hidden;

address@hidden
*Subject:* Re: Network connection with COLO VM



Hi David,  Zhang,



Thanks for replying my question.

We know why will occur this issue.

As you said, the COLO VM's network needs

colo-proxy to control packets, so the guest's

interface should set the filter to solve the problem.



But we found another question, when we set the

fault-tolerance feature to guest (primary VM is running,

secondary VM is pausing), the guest's network would not

responds any request for a while (in our environment

about 20~30 secs) after secondary VM runs.



Does it be a normal situation, or a known issue?



Our test is creating primary VM for a while, then creating

secondary VM to make it with COLO feature.



Hi Daniel,



Happy to hear you have solved ssh disconnection issue.



Do you use Lukas’s patch on this case?

I think we need use block mirror to sync the disk to secondary node

first,

then stop the primary VM and build COLO system.

In the stop moment, you need add some netfilter and chardev socket node
for COLO, maybe you need re-check this part.



Best Regard,

Daniel Cho



Zhang, Chen <address@hidden> 於 2019年11月28日 週四 上午9:26寫道：

-----Original Message-----
From: Dr. David Alan Gilbert <address@hidden>
Sent: Wednesday, November 27, 2019 6:51 PM
To: Daniel Cho <address@hidden>; Zhang, Chen
<address@hidden>; address@hidden
Cc: address@hidden
Subject: Re: Network connection with COLO VM

* Daniel Cho (address@hidden) wrote:

Hello everyone,

Could we ssh to colo VM (means PVM & SVM are starting)?

Lets cc in Zhang Chen and Lukas Straub.

Thanks Dave.

SSH will connect to colo VM for a while, but it will disconnect

with

error
*client_loop: send disconnect: Broken pipe*

It seems to colo VM could not keep network session.

Does it be a known issue?

That sounds like the COLO proxy is getting upset; it's supposed to

compare

packets sent by the primary and secondary and only send one to the

outside

- you shouldn't be talking directly to the guest, but always via the

proxy.  See

docs/colo-proxy.txt

Hi Daniel,

I have try ssh to COLO guest with 8 hours, not occurred this issue.
Please check your network/qemu configuration.
But I found another problem maybe related this issue, if no network
communication for a period of time(maybe 10min), the first message

send to

guest have a chance with delay(maybe 1-5 sec), I will try to fix it

when I

have time.

Thanks
Zhang Chen

Dave

Best Regard,
Daniel Cho

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Network connection with COLO VM, Daniel Cho, 2019/12/01
- Re: Network connection with COLO VM, Dr. David Alan Gilbert, 2019/12/02
  - Re: Network connection with COLO VM, Daniel Cho, 2019/12/03
    - Re: Network connection with COLO VM, Dr. David Alan Gilbert, 2019/12/03
    - Re: Network connection with COLO VM, Zhang, Chen <=
    - Re: Network connection with COLO VM, Daniel Cho, 2019/12/06

Prev by Date: [PATCH v3] travis.yml: Run tcg tests with tci
Next by Date: Re: [PATCH v2 1/3] virtio: add ability to delete vq through a pointer
Previous by thread: Re: Network connection with COLO VM
Next by thread: Re: Network connection with COLO VM
Index(es):
- Date
- Thread