[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT) |
Date: |
Mon, 29 Feb 2016 09:47:15 +0000 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
* Hailiang Zhang (address@hidden) wrote:
> On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
> >* Dr. David Alan Gilbert (address@hidden) wrote:
> >>* zhanghailiang (address@hidden) wrote:
> >>>From: root <address@hidden>
> >>>
> >>>This is the 15th version of COLO (Still only support periodic checkpoint).
> >>>
> >>>Here is only COLO frame part, you can get the whole codes from github:
> >>>https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode
> >>>
> >>>There are little changes for this series except the network releated part.
> >>
> >>I was looking at the time the guest is paused during COLO and
> >>was surprised to find one of the larger chunks was the time to reset
> >>the guest before loading each checkpoint; I've traced it part way, the
> >>biggest contributors for my test VM seem to be:
> >>
> >> 3.8ms pcibus_reset: VGA
> >> 1.8ms pcibus_reset: virtio-net-pci
> >> 1.5ms pcibus_reset: virtio-blk-pci
> >> 1.5ms qemu_devices_reset: piix4_reset
> >> 1.1ms pcibus_reset: piix3-ide
> >> 1.1ms pcibus_reset: virtio-rng-pci
> >>
> >>I've not looked deeper yet, but some of these are very silly;
> >>I'm running with -nographic so why it's taking 3.8ms to reset VGA is
> >>going to be interesting.
> >>Also, my only block device is the virtio-blk, so while I understand the
> >>standard PC machine has the IDE controller, why it takes it over a ms
> >>to reset an unused device.
> >
> >OK, so I've dug a bit deeper, and it appears that it's the changes in
> >PCI bars that actually take the time; every time we do a reset we
> >reset all the BARs, this causes it to do a pci_update_mappings and
> >end up doing a memory_region_del_subregion.
> >Then we load the config space of the PCI device as we do the vmstate_load,
> >and this recreates all the mappings again.
> >
> >I'm not sure what the fix is, but that sounds like it would
> >speed up the checkpoints usefully if we can avoid the map/remap when
> >they're the same.
> >
>
> Interesting, and thanks for your report.
>
> We already known qemu_system_reset() is a time-consuming function, we
> shouldn't
> call it here, but if we didn't do that, there will be a bug, which we have
> reported before in the previous COLO series, the bellow is the copy of the
> related
> patch comment:
>
> COLO VMstate: Load VM state into qsb before restore it
>
> We should not destroy the state of secondary until we receive the whole
> state from the primary, in case the primary fails in the middle of sending
> the state, so, here we cache the device state in Secondary before restore
> it.
>
> Besides, we should call qemu_system_reset() before load VM state,
> which can ensure the data is intact.
> Note: If we discard qemu_system_reset(), there will be some odd error,
> For exmple, qemu in slave side crashes and reports:
>
> KVM: entry failed, hardware error 0x7
> EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
> ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
> EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0040 00000400 0000ffff 00009300
> CS =f000 000f0000 0000ffff 00009b00
> SS =434f 000434f0 0000ffff 00009300
> DS =434f 000434f0 0000ffff 00009300
> FS =0000 00000000 0000ffff 00009300
> GS =0000 00000000 0000ffff 00009300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 00000000 0000ffff 00008b00
> GDT= 0002dcc8 00000047
> IDT= 00000000 0000ffff
> CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
> DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90
> fa fc 66 c3 66 53 66 89 c3
> ERROR: invalid runstate transition: 'internal-error' -> 'colo'
>
> The reason is, some of the device state will be ignored when saving
> device state to slave,
> if the corresponding data is in its initial value, such as 0.
> But the device state in slave maybe in initialized value, after a loop of
> checkpoint,
> there will be inconsistent for the value of device state.
> This will happen when the PVM reboot or SVM run ahead of PVM in the
> startup process.
> Signed-off-by: zhanghailiang <address@hidden>
> Signed-off-by: Yang Hongyang <address@hidden>
> Signed-off-by: Gonglei <address@hidden>
> Reviewed-by: Dr. David Alan Gilbert <address@hidden
>
> As described above, some values of the device state are zero, they will be
> ignored during migration, it has no problem for normal migration, because
> for the VM in destination, the initial values will be zero too. But for COLO,
> there are more than one round of migration, the related values may be changed
> from no-zero to zero, they will be ignored too in the next checkpoint, the
> VMstate will be inconsistent for SVM.
Yes, this doesn't really surprise me; a lot of the migration code does weird
things
like this, so reset is the safest way.
> The above error is caused directly by wrong value of 'async_pf_en_msr'.
>
> static const VMStateDescription vmstate_async_pf_msr = {
> .name = "cpu/async_pf_msr",
> .version_id = 1,
> .minimum_version_id = 1,
> .needed = async_pf_msr_needed,
> .fields = (VMStateField[]) {
> VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
> VMSTATE_END_OF_LIST()
> }
> };
>
> static bool async_pf_msr_needed(void *opaque)
> {
> X86CPU *cpu = opaque;
>
> return cpu->env.async_pf_en_msr != 0;
> }
>
> Some other VMstate of registers in CPUX86State have the same problem,
> we can't make sure they won't cause any problems if the values of them
> are incorrect.
> So here, we just simply call qemu_system_reset() to avoid the inconsistent
> problem.
> Besides, compared with the most time-consuming operation (ram flushed from
> COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
> acceptable ;)
I've got a patch where I've tried to multithread the flush - it's made it a
little
faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)
> Another choice to fix the problem is to save the VMstate ignoring the needed()
> return value, but this method is not so graceful.
> diff --git a/migration/vmstate.c b/migration/vmstate.c
> index e5388f0..7d15bba 100644
> --- a/migration/vmstate.c
> +++ b/migration/vmstate.c
> @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const
> VMStateDescription *vmsd,
> bool subsection_found = false;
>
> while (sub && sub->needed) {
> - if (sub->needed(opaque)) {
> + if (sub->needed(opaque) || migrate_in_colo_state()) {
> const VMStateDescription *vmsd = sub->vmsd;
> uint8_t len;
Maybe, but I suspect this will also find other strange cases in devices.
For example, without the reset I wouldn't really trust devices to load the
new state in; they wouldn't have been tested with all the states they
might have left themselves in previously.
Dave
> Thanks,
> Hailiang
>
> >Dave
> >
> >>
> >>I guess reset is normally off anyones radar since it's outside
> >>the time anyone cares about, but I guess perhaps the guys trying
> >>to make qemu start really quickly would be interested.
> >>
> >>Dave
> >>
> >>>
> >>>Patch status:
> >>>Unreviewed: patch 21,27,28,29,33,38
> >>>Updated: patch 31,34,35,37
> >>>
> >>>TODO:
> >>>1. Checkpoint based on proxy in qemu
> >>>2. The capability of continuous FT
> >>>3. Optimize the VM's downtime during checkpoint
> >>>
> >>>v15:
> >>> - Go on the shutdown process if encounter error while sending shutdown
> >>> message to SVM. (patch 24)
> >>> - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
> >>> some useless comment. (patch 31, Jason)
> >>> - Call object_new_with_props() directly to add filter in
> >>> colo_add_buffer_filter. (patch 34, Jason)
> >>> - Re-implement colo_set_filter_status() based on COLOBufferFilters
> >>> list. (patch 35)
> >>> - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
> >>> list. (patch 37)
> >>>v14:
> >>> - Re-implement the network processing based on netfilter (Jason Wang)
> >>> - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
> >>> - Split two new patches (patch 27/28) from patch 29
> >>> - Fix some other comments from Dave and Markus.
> >>>
> >>>v13:
> >>> - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
> >>> instead of return value to indicate success or failure. (patch 10)
> >>> - Remove the optional error message for COLO_EXIT event. (patch 25)
> >>> - Use semaphore to notify colo/colo incoming loop that failover work is
> >>> finished. (patch 26)
> >>> - Move COLO shutdown related codes to colo.c file. (patch 28)
> >>> - Fix memory leak bug for colo incoming loop. (new patch 31)
> >>> - Re-use some existed helper functions to realize the process of
> >>> saving/loading ram and device. (patch 32)
> >>> - Fix some other comments from Dave and Markus.
> >>>
> >>>zhanghailiang (38):
> >>> configure: Add parameter for configure to enable/disable COLO support
> >>> migration: Introduce capability 'x-colo' to migration
> >>> COLO: migrate colo related info to secondary node
> >>> migration: Integrate COLO checkpoint process into migration
> >>> migration: Integrate COLO checkpoint process into loadvm
> >>> COLO/migration: Create a new communication path from destination to
> >>> source
> >>> COLO: Implement colo checkpoint protocol
> >>> COLO: Add a new RunState RUN_STATE_COLO
> >>> QEMUSizedBuffer: Introduce two help functions for qsb
> >>> COLO: Save PVM state to secondary side when do checkpoint
> >>> COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
> >>> ram/COLO: Record the dirty pages that SVM received
> >>> COLO: Load VMState into qsb before restore it
> >>> COLO: Flush PVM's cached RAM into SVM's memory
> >>> COLO: Add checkpoint-delay parameter for migrate-set-parameters
> >>> COLO: synchronize PVM's state to SVM periodically
> >>> COLO failover: Introduce a new command to trigger a failover
> >>> COLO failover: Introduce state to record failover process
> >>> COLO: Implement failover work for Primary VM
> >>> COLO: Implement failover work for Secondary VM
> >>> qmp event: Add COLO_EXIT event to notify users while exited from COLO
> >>> COLO failover: Shutdown related socket fd when do failover
> >>> COLO failover: Don't do failover during loading VM's state
> >>> COLO: Process shutdown command for VM in COLO state
> >>> COLO: Update the global runstate after going into colo state
> >>> savevm: Introduce two helper functions for save/find loadvm_handlers
> >>> entry
> >>> migration/savevm: Add new helpers to process the different stages of
> >>> loadvm
> >>> migration/savevm: Export two helper functions for savevm process
> >>> COLO: Separate the process of saving/loading ram and device state
> >>> COLO: Split qemu_savevm_state_begin out of checkpoint process
> >>> net/filter: Add a 'status' property for filter object
> >>> filter-buffer: Accept zero interval
> >>> net: Add notifier/callback for netdev init
> >>> COLO/filter: add each netdev a buffer filter
> >>> COLO: manage the status of buffer filters for PVM
> >>> filter-buffer: make filter_buffer_flush() public
> >>> COLO: flush buffered packets in checkpoint process or exit COLO
> >>> COLO: Add block replication into colo process
> >>>
> >>> configure | 11 +
> >>> docs/qmp-events.txt | 16 +
> >>> hmp-commands.hx | 15 +
> >>> hmp.c | 15 +
> >>> hmp.h | 1 +
> >>> include/exec/ram_addr.h | 1 +
> >>> include/migration/colo.h | 42 ++
> >>> include/migration/failover.h | 33 ++
> >>> include/migration/migration.h | 16 +
> >>> include/migration/qemu-file.h | 3 +-
> >>> include/net/filter.h | 5 +
> >>> include/net/net.h | 4 +
> >>> include/sysemu/sysemu.h | 9 +
> >>> migration/Makefile.objs | 2 +
> >>> migration/colo-comm.c | 76 ++++
> >>> migration/colo-failover.c | 83 ++++
> >>> migration/colo.c | 866
> >>> ++++++++++++++++++++++++++++++++++++++++++
> >>> migration/migration.c | 109 +++++-
> >>> migration/qemu-file-buf.c | 61 +++
> >>> migration/ram.c | 175 ++++++++-
> >>> migration/savevm.c | 114 ++++--
> >>> net/filter-buffer.c | 14 +-
> >>> net/filter.c | 40 ++
> >>> net/net.c | 33 ++
> >>> qapi-schema.json | 104 ++++-
> >>> qapi/event.json | 15 +
> >>> qemu-options.hx | 4 +-
> >>> qmp-commands.hx | 23 +-
> >>> stubs/Makefile.objs | 1 +
> >>> stubs/migration-colo.c | 54 +++
> >>> trace-events | 8 +
> >>> vl.c | 31 +-
> >>> 32 files changed, 1908 insertions(+), 76 deletions(-)
> >>> create mode 100644 include/migration/colo.h
> >>> create mode 100644 include/migration/failover.h
> >>> create mode 100644 migration/colo-comm.c
> >>> create mode 100644 migration/colo-failover.c
> >>> create mode 100644 migration/colo.c
> >>> create mode 100644 stubs/migration-colo.c
> >>>
> >>>--
> >>>1.8.3.1
> >>>
> >>>
> >>--
> >>Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >--
> >Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >
> >.
> >
>
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- [Qemu-devel] [PATCH COLO-Frame v15 36/38] filter-buffer: make filter_buffer_flush() public, (continued)
- [Qemu-devel] [PATCH COLO-Frame v15 36/38] filter-buffer: make filter_buffer_flush() public, zhanghailiang, 2016/02/21
- [Qemu-devel] [PATCH COLO-Frame v15 37/38] COLO: flush buffered packets in checkpoint process or exit COLO, zhanghailiang, 2016/02/21
- [Qemu-devel] [PATCH COLO-Frame v15 16/38] COLO: synchronize PVM's state to SVM periodically, zhanghailiang, 2016/02/21
- [Qemu-devel] [PATCH COLO-Frame v15 27/38] migration/savevm: Add new helpers to process the different stages of loadvm, zhanghailiang, 2016/02/21
- [Qemu-devel] [PATCH COLO-Frame v15 28/38] migration/savevm: Export two helper functions for savevm process, zhanghailiang, 2016/02/21
- Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT), Dr. David Alan Gilbert, 2016/02/25