[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] RFC Multi-threaded TCG design document
From: |
Alex Bennée |
Subject: |
Re: [Qemu-devel] RFC Multi-threaded TCG design document |
Date: |
Mon, 15 Jun 2015 13:36:07 +0100 |
Mark Burton <address@hidden> writes:
> I think we SHOUDL use the wiki - and keep it current. A lot of what you have
> is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…
I'll do a pass today and update it to point to lists, discussions and
WIP trees.
>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <address@hidden> wrote:
>>
>>
>> Frederic Konrad <address@hidden> writes:
>>
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>>
>>> Hi Alex,
>>>
>>> I've completed some of the points below. We will also work on a design
>>> decisions
>>> document to add to this one.
>>>
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>>
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>>
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>>
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>>
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>>
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>>
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>>
>>>> Anyway as ever I look forward to the comments and discussion:
>>>>
>>>> STATUS: DRAFTING
>>>>
>>>> Introduction
>>>> ============
>>>>
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>>
>>>> Memory Consistency
>>>> ==================
>>>>
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>>
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>>
>>>> Memory Barriers
>>>> ---------------
>>>>
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>>
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>>
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>>
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>>
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>>
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>>
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>>
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>>
>>>> Synchronisation Primitives
>>>> --------------------------
>>>>
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>>
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>>
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>>
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>> - atomics
>>>> - Introduce some atomic TCG ops for the common semantics
>>>> - The default fallback helper function will use qemu_atomics
>>>> - Each backend can then add a more efficient implementation
>>>> - load/store exclusive
>>>> [AJB:
>>>> There are currently a number proposals of interest:
>>>> - Greensocs tweaks to ldst ex (using locks)
>>>> - Slow-path for atomic instruction translation [2]
>>>> - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>> ]
>>>>
>>>>
>>>> Shared Data Structures
>>>> ======================
>>>>
>>>> Global TCG State
>>>> ----------------
>>>>
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>>
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>>
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>>
>> How about:
>>
>> DESIGN REQUIREMENT:
>> - Code generation and patching will be protected by a lock
>> - Jump cache modification will assert all CPUs are halted
>>
>>>>
>>>> Memory maps and TLBs
>>>> --------------------
>>>>
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>>
>>>> - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>> - Dirty page tracking (for code gen, migration and display)
>>>> - Virtual TLB (for translating guest address->real address)
>>>>
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - TLB Flush All/Page
>>>> - can be across-CPUs
>>>> - will need all other CPUs brought to a halt
>>>> - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>> - This is a per-CPU table - by definition can't race
>>>> - updated by it's own thread when the slow-path is forced
>>> Actually as we have approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one
>>> entry for
>>> tb_invalidate).
>>
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>>
>>>> Emulated hardware state
>>>> -----------------------
>>>>
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>>
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - Access to IO Memory should be serialised by an IOMem mutex
>>>> - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>>
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>>
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>>
>>>
>>> Thanks,
>>
>> Thanks for your quick review :-)
>>
>>> Fred
>>>
>>>> IO Subsystem
>>>> ------------
>>>>
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - The dataplane should continue to be protected by the iothread locks
>>>>
>>>>
>>>> References
>>>> ==========
>>>>
>>>> [1]
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>>
>>>>
>>>>
>>
>> --
>> Alex Bennée
>
>
> +44 (0)20 7100 3485 x 210
> +33 (0)5 33 52 01 77x 210
>
> +33 (0)603762104
> mark.burton
--
Alex Bennée