Re: Thoughts on VM fence infrastructure

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Thoughts on VM fence infrastructure

From:	Felipe Franciosi
Subject:	Re: Thoughts on VM fence infrastructure
Date:	Tue, 1 Oct 2019 10:46:24 +0000
Hi Daniel!


> On Oct 1, 2019, at 11:31 AM, Daniel P. Berrangé <address@hidden> wrote:
> 
> On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
>> 
>> 
>>> On Oct 1, 2019, at 9:23 AM, Dr. David Alan Gilbert <address@hidden> wrote:
>>> 
>>> * Felipe Franciosi (address@hidden) wrote:
>>>> 
>>>> 
>>>>> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <address@hidden> 
>>>>> wrote:
>>>>> 
>>>>> * Felipe Franciosi (address@hidden) wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <address@hidden> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> * Felipe Franciosi (address@hidden) wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <address@hidden> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> * Felipe Franciosi (address@hidden) wrote:
>>>>>>>>>> Hi David,
>>>>>>>>>> 
>>>>>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert 
>>>>>>>>>>> <address@hidden> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> * Felipe Franciosi (address@hidden) wrote:
>>>>>>>>>>>> Heyall,
>>>>>>>>>>>> 
>>>>>>>>>>>> We have a use case where a host should self-fence (and all VMs 
>>>>>>>>>>>> should
>>>>>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>>>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>>>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>>>>>>>> with those is that various failures could lead to _those_ services
>>>>>>>>>>>> being unavailable and the fencing wouldn't be enforced as it 
>>>>>>>>>>>> should.
>>>>>>>>>>>> 
>>>>>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>>>>>>>> 
>>>>>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>>>>>>>> 
>>>>>>>>>> The argument above is that we would have to rely on this external
>>>>>>>>>> service being functional. Consider the case where the host is
>>>>>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>>>>>>>> filesystem preventing it from restarting. The VMs would never die.
>>>>>>>>> 
>>>>>>>>> Yeh that could fail.
>>>>>>>>> 
>>>>>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>>>>>>>> exit() would be more reliable. Thoughts?
>>>>>>>>> 
>>>>>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>>>>>>>> signal is pretty solid; it would take the kernel to do that once it's
>>>>>>>>> set.
>>>>>>>> 
>>>>>>>> I'm confused about why the kernel needs to be involved. If this is a
>>>>>>>> timer off the Qemu main loop, it can just check on the heartbeat
>>>>>>>> condition (which should be customisable) and call abort() if that's
>>>>>>>> not satisfied. If you agree on that I'd like to talk about how that
>>>>>>>> check could be made customisable.
>>>>>>> 
>>>>>>> There are times when the main loop can get blocked even though the CPU
>>>>>>> threads can be running and can in some configurations perform IO
>>>>>>> even without the main loop (I think!).
>>>>>> 
>>>>>> Ah, that's a very good point. Indeed, you can perform IO in those
>>>>>> cases specially when using vhost devices.
>>>>>> 
>>>>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>>>>>> will send that signal however broken qemu is.
>>>>>> 
>>>>>> Got you now. That's probably better. Do you reckon a signal is
>>>>>> preferable over SIGEV_THREAD?
>>>>> 
>>>>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
>>>>> that's a complete nightmare to debug - your process just goes *pop*
>>>>> with no apparent reason why.
>>>>> I've not used SIGEV_THREAD - it looks promising though.
>>>> 
>>>> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
>>>> up a new thread each time). On the other hand, as you said, SIGKILL
>>>> makes it harder to debug.
>>>> 
>>>> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
>>>> needs to come from Qemu itself (eg. a timer in the main loop,
>>>> something we already ruled unsuitable, or a qmp command which
>>>> constitutes an external dependency that we also ruled undesirable).
>>> 
>>> OK, there's two reasons I think this isn't that bad/is good:
>>>  a) It's an external dependency - but if it fails the result is the
>>>     system fails, rather than the system keeps on running; so I think
>>>     that's the balance you were after; it's the opposite from
>>>     the external watchdog.
>> 
>> Right. I like where you are coming from. And I think a mix of these
>> may be the best way forwards. I'll elaborate on it below.
>> 
>>> 
>>>  b) You need some external system anyway to tell QEMU when it's
>>>     OK - what's your definitino of a failed system?
>> 
>> The feature is targeted at providing a self-fencing mechanism for
>> Qemu. If a host is unreachable for whatever reason (eg. sshd down, ovs
>> died, oomkiller took services out, physical network failure), it
>> should guarantee that VMs won't be running after a certain amount of
>> time. To your point, if this external software doesn't come in and
>> touch the file, that's because it can't reach the host or it wants the
>> host to self-fence. The qualifying Qemus should therefore be
>> considered dead after a "deadline" period (since the last time the
>> control file was touched).
> 
> This all sounds reasonable, but I don't see the value in doing this
> work
> of this work in QEMU.

I'll elaborate below.

>  
> 
>> 
>>> 
>>>> What if, when self-fencing is enabled, Qemu kicks off a new thread
>>>> from the start which does nothing but periodically wake up, verify the
>>>> heartbeat condition and log()+abort() if required? (Then we wouldn't
>>>> need the kernel timer.)
>>> 
>>> I'd make that thread bump the kernel timer along.
>> 
>> I think combining the thread's logic with the kernel timer makes the
>> whole thing a lot more solid. See below.
>> 
>>> 
>>>>> 
>>>>>> I'm still wondering how to make this customisable so that different
>>>>>> types of heartbeat could be implemented (preferably without creating
>>>>>> external dependencies per discussion above). Thoughts welcome.
>>>>> 
>>>>> Yes, you need something to enable it, and some safe way to retrigger
>>>>> the timer.  A qmp command marked as 'oob' might be the right way -
>>>>> another qm command can't block it.
>>>> 
>>>> This qmp approach is slightly different than the external dependency
>>>> that itself kills Qemu; if it doesn't run, then Qemu dies because the
>>>> kernel timer is not updated. But this is also a heavyweight approach.
>>>> We are talking about a service that needs to frequently connect to all
>>>> running VMs on a host to reset the timer.
>>>> 
>>>> But it does allow for the customisable heartbeat: the logic behind
>>>> what triggers the command is completely flexible.
>>>> 
>>>> Thinking about this idea of a separate Qemu thread, one thing that
>>>> came to mind is this:
>>>> 
>>>> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
>>>> 
>>>> Qemu could fire up a thread that stat()s <file> (every <recheck>
>>>> seconds or on a default interval) and log()+abort() the whole process
>>>> if the last modification time of the file is older than <deadline>. If
>>>> <file> goes away (ie. stat() gives ENOENT), then it either fences
>>>> immediately or ignores it, not sure which is more sensible.
>>>> 
>>>> Thoughts?
>>> 
>>> As above; I'm OK with using a file with that; but I'd make that thread
>>> bump the kernel timer along; if that thread gets stuck somehow the
>>> kernel still nukes your process.
>> 
>> 
>> Awesome. So check this out:
>> 
>> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5][,harddeadline=61]
>> 
>> We can default <harddeadline> to <deadline+1> and enforce that:
>> -  <deadline> is a multiple of <recheck>.
>> - <harddeadline> is bigger than <deadline>
>> 
>> When <deadline> expires, we can log() + abort(), but if <harddeadline>
>> expires, we can rest assured the kernel will come around and SIGKILL
>> Qemu. If there's demand for it, this can later be enhanced by adding
>> more parameters which set the fence thread scheduling priority, &c.
>> 
>> If that sounds ok I'll send an RFC as soon as I get a chance and we
>> can take it from there.
> 
> I don't really see the point in doing any of this in QEMU, as opposed to
> using the general purpose self-fencing features of the host OS. As an
> example, hardware watchdogs are a built-in feature of systemd
> 
>   "To make use of the hardware watchdog it is sufficient to set the
>    RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults
>    to 0 (i.e. no hardware watchdog use). Set it to a value like 20s 
>    and the watchdog is enabled. After 20s of no keep-alive pings the 
>    hardware will reset itself. Note that systemd will send a ping to
>    the hardware at half the specified interval, i.e. every 10s. And 
>    that's already all there is to it. By enabling this single, simple
>    option you have turned on supervision by the hardware of systemd 
>    and the kernel beneath it.[2]"
> 
>    
> https://urldefense.proofpoint.com/v2/url?u=http-3A__0pointer.de_blog_projects_watchdog.html&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=HiJorxNd588b8gTpUyQB2qbLPq4-6UMqyhcHAcwrRlE&e=
>  
> 

(Apologies for the mangled URL, nothing I can do about that.) :(

There are several points which favour adding this to Qemu:
- Not all environments use systemd.
- HW watchdogs always reboot the host, which is too drastic.
- You may not want to protect all VMs in the same way.

> When a host becomes non-responsive, for example, due to a network error
> I would not have confidence in QEMU being reliable enough to trigger
> any self-fencing code. I've seen many bug reports where QEMU has entirely
> hung due to non-responsive network based storage. 

Completely agree with you. There are various failures where Qemu
itself wouldn't be able to self-fence, but there are many in which it
would. There's also the fact that you may not want to protect all VMs
equally. To that point, nothing stops "harder" deadlines to be used.
The idea being discussed already involves a two-level protection model
where Qemu tries to suicide, but if it fails the kernel will do it.

With that in mind, the libvirt API could actually offer a third-level
protection which sets a HW watchdog (via systemd or otherwise). That
would be a host setting, though, but part of the same offering.


> IMHO doing this at the host OS level is going to be more reliable in
> terms of detecting the problem in the first place, as well as more
> reliable in taking the action - its very difficult for a hardware CPU
> reset to fail to work.

Absolutely, but it's a very drastic measure that:
- May be unnecessary.
- Will fence everything even perhaps only some VMs need protection.

What are your thoughts on this 3-level approach?
1) Qemu tries to log() + abort() (deadline)
2) Kernel sends SIGKILL (harddeadline)
3) HW watchdog kicks in (harderdeadline)

(Better names welcome.)

F.

> 
> Regards,
> Daniel
> -- 
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=VTVB_KrstsrM_b0QmszrGif9RUNbRjbor-G4T_hT2yQ&e=
>        -o-    
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=psF8iLWLoqpZLq_FEBFXGZrzRzT7lu3Hf1M52NnHJo4&e=
>   :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=cDs9EbaDsZq5ABOi6BiK1WhisZ4PZdtr34YCkqMiWqc&e=
>           -o-            
> https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=8yXoHeNZP1AH8FNipWuOSG2xHwQaNiG5JumTtwETP-I&e=
>   :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=qzAxwpwVFQj2vuRr65verRCahoLq5nLbwbyVcjTRf0M&e=
>      -o-    
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=t3kVrkYK52n7YlwzkquKLAjVt3NSlSF6CF7CeGLHvNM&e=
>   :|
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Thoughts on VM fence infrastructure, Dr. David Alan Gilbert, 2019/10/01
- Re: Thoughts on VM fence infrastructure, Felipe Franciosi, 2019/10/01
  - Re: Thoughts on VM fence infrastructure, Dr. David Alan Gilbert, 2019/10/01
  - Re: Thoughts on VM fence infrastructure, Daniel P . Berrangé, 2019/10/01
    - Re: Thoughts on VM fence infrastructure, Felipe Franciosi <=
    - Re: Thoughts on VM fence infrastructure, Daniel P . Berrangé, 2019/10/01
    - Re: Thoughts on VM fence infrastructure, Felipe Franciosi, 2019/10/01
- Re: Thoughts on VM fence infrastructure, Daniel P . Berrangé, 2019/10/01
Prev by Date: Re: [PULL 00/12] s390x qemu updates 20190930
Next by Date: Re: Thoughts on VM fence infrastructure
Previous by thread: Re: Thoughts on VM fence infrastructure
Next by thread: Re: Thoughts on VM fence infrastructure
Index(es):
- Date
- Thread