[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] RFC Multi-threaded TCG design document
From: |
Mark Burton |
Subject: |
Re: [Qemu-devel] RFC Multi-threaded TCG design document |
Date: |
Mon, 15 Jun 2015 12:51:27 +0200 |
I think we SHOUDL use the wiki - and keep it current. A lot of what you have is
in the wiki too, but I’d like to see the wiki updated.
We will add our stuff there too…
Cheers
Mark.
> On 15 Jun 2015, at 12:06, Alex Bennée <address@hidden> wrote:
>
>
> Frederic Konrad <address@hidden> writes:
>
>> On 12/06/2015 18:37, Alex Bennée wrote:
>>> Hi,
>>
>> Hi Alex,
>>
>> I've completed some of the points below. We will also work on a design
>> decisions
>> document to add to this one.
>>
>> We probably want to merge that with what we did on the wiki?
>> http://wiki.qemu.org/Features/tcg-multithread
>
> Well hopefully there is cross-over as I started with the wiki as a basic
> ;-)
>
> Do we want to just keep the wiki as the live design document or put
> pointers to the current drafts? I'm hoping eventually the page will just
> point to the design in the doc directory at git.qemu.org.
>
>>> One thing that Peter has been asking for is a design document for the
>>> way we are going to approach multi-threaded TCG emulation. I started
>>> with the information that was captured on the wiki and tried to build on
>>> that. It's almost certainly incomplete but I thought it would be worth
>>> posting for wider discussion early rather than later.
>>>
>>> One obvious omission at the moment is the lack of discussion about other
>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>> dirty page tracking bits, I'm sure there is more).
>>>
>>> I've also deliberately tried to avoid documenting the design decisions
>>> made in the current Greensoc's patch series. This is so we can
>>> concentrate on the big picture before getting side-tracked into the
>>> implementation details.
>>>
>>> I have now started digging into the Greensocs code in earnest and the
>>> plan is eventually the design and the implementation will converge on a
>>> final documented complete solution ;-)
>>>
>>> Anyway as ever I look forward to the comments and discussion:
>>>
>>> STATUS: DRAFTING
>>>
>>> Introduction
>>> ============
>>>
>>> This document outlines the design for multi-threaded TCG emulation.
>>> The original TCG implementation was single threaded and dealt with
>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>> lot of things but became increasingly limited as systems being
>>> emulated gained additional cores and per-core performance gains for host
>>> systems started to level off.
>>>
>>> Memory Consistency
>>> ==================
>>>
>>> Between emulated guests and host systems there are a range of memory
>>> consistency models. While emulating weakly ordered systems on strongly
>>> ordered hosts shouldn't cause any problems the same is not true for
>>> the reverse setup.
>>>
>>> The proposed design currently does not address the problem of
>>> emulating strong ordering on a weakly ordered host although even on
>>> strongly ordered systems software should be using synchronisation
>>> primitives to ensure correct operation.
>>>
>>> Memory Barriers
>>> ---------------
>>>
>>> Barriers (sometimes known as fences) provide a mechanism for software
>>> to enforce a particular ordering of memory operations from the point
>>> of view of external observers (e.g. another processor core). They can
>>> apply to any memory operations as well as just loads or stores.
>>>
>>> The Linux kernel has an excellent write-up on the various forms of
>>> memory barrier and the guarantees they can provide [1].
>>>
>>> Barriers are often wrapped around synchronisation primitives to
>>> provide explicit memory ordering semantics. However they can be used
>>> by themselves to provide safe lockless access by ensuring for example
>>> a signal flag will always be set after a payload.
>>>
>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>
>>> This would enforce a strong load/store ordering so all loads/stores
>>> complete at the memory barrier. On single-core non-SMP strongly
>>> ordered backends this could become a NOP.
>>>
>>> There may be a case for further refinement if this causes performance
>>> bottlenecks.
>>>
>>> Memory Control and Maintenance
>>> ------------------------------
>>>
>>> This includes a class of instructions for controlling system cache
>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>> are often seen when code modification has taken place to ensure the
>>> changes take effect.
>>>
>>> Synchronisation Primitives
>>> --------------------------
>>>
>>> There are two broad types of synchronisation primitives found in
>>> modern ISAs: atomic instructions and exclusive regions.
>>>
>>> The first type offer a simple atomic instruction which will guarantee
>>> some sort of test and conditional store will be truly atomic w.r.t.
>>> other cores sharing access to the memory. The classic example is the
>>> x86 cmpxchg instruction.
>>>
>>> The second type offer a pair of load/store instructions which offer a
>>> guarantee that an region of memory has not been touched between the
>>> load and store instructions. An example of this is ARM's ldrex/strex
>>> pair where the strex instruction will return a flag indicating a
>>> successful store only if no other CPU has accessed the memory region
>>> since the ldrex.
>>>
>>> Traditionally TCG has generated a series of operations that work
>>> because they are within the context of a single translation block so
>>> will have completed before another CPU is scheduled. However with
>>> the ability to have multiple threads running to emulate multiple CPUs
>>> we will need to explicitly expose these semantics.
>>>
>>> DESIGN REQUIREMENTS:
>>> - atomics
>>> - Introduce some atomic TCG ops for the common semantics
>>> - The default fallback helper function will use qemu_atomics
>>> - Each backend can then add a more efficient implementation
>>> - load/store exclusive
>>> [AJB:
>>> There are currently a number proposals of interest:
>>> - Greensocs tweaks to ldst ex (using locks)
>>> - Slow-path for atomic instruction translation [2]
>>> - Helper-based Atomic Instruction Emulation (AIE) [3]
>>> ]
>>>
>>>
>>> Shared Data Structures
>>> ======================
>>>
>>> Global TCG State
>>> ----------------
>>>
>>> We need to protect the entire code generation cycle including any post
>>> generation patching of the translated code. This also implies a shared
>>> translation buffer which contains code running on all cores. Any
>>> execution path that comes to the main run loop will need to hold a
>>> mutex for code generation. This also includes times when we need flush
>>> code or jumps from the tb_cache.
>>>
>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>> and jump cache modification
>> Actually from my point of view jump cache modification requires more than a
>> lock as other VCPU thread can be executing code during the modification.
>>
>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>> tb_invalidate which need all CPU to be halted anyway.
>
> How about:
>
> DESIGN REQUIREMENT:
> - Code generation and patching will be protected by a lock
> - Jump cache modification will assert all CPUs are halted
>
>>>
>>> Memory maps and TLBs
>>> --------------------
>>>
>>> The memory handling code is fairly critical to the speed of memory
>>> access in the emulated system.
>>>
>>> - Memory regions (dividing up access to PIO, MMIO and RAM)
>>> - Dirty page tracking (for code gen, migration and display)
>>> - Virtual TLB (for translating guest address->real address)
>>>
>>> There is a both a fast path walked by the generated code and a slow
>>> path when resolution is required. When the TLB tables are updated we
>>> need to ensure they are done in a safe way by bringing all executing
>>> threads to a halt before making the modifications.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>> - TLB Flush All/Page
>>> - can be across-CPUs
>>> - will need all other CPUs brought to a halt
>>> - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>> - This is a per-CPU table - by definition can't race
>>> - updated by it's own thread when the slow-path is forced
>> Actually as we have approximately the same behaviour for all of this memory
>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are
>> all playing with
>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>> generic "exit and do something" mechanism for the CPU threads.
>> So every VCPU threads has a list of thing to do when they exit (such as
>> clearing it's
>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one
>> entry for
>> tb_invalidate).
>
> Sounds like I should write an additional section to describe the process
> of halting CPUs and carrying out deferred per-CPU actions as well as
> ensuring we can tell when they are all halted.
>
>>> Emulated hardware state
>>> -----------------------
>>>
>>> Currently the hardware emulation has no protection against
>>> multiple-accesses. However guest systems accessing emulated hardware
>>> should be carrying out their own locking to prevent multiple CPUs
>>> confusing the hardware. Of course there is no guarantee the there
>>> couldn't be a broken guest that doesn't lock so you could get racing
>>> accesses to the hardware.
>>>
>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>> result of a guest triggered transaction.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>> - Access to IO Memory should be serialised by an IOMem mutex
>>> - The mutex should be recursive (e.g. allowing pid to relock itself)
>> That might be done with the global mutex as it is today?
>> We need changes here anyway to have VCPU threads running in parallel.
>
> I'm not sure re-using the global mutex is a good idea. I've had to hack
> the global mutex to allow recursive locking to get around the virtio
> hang I discovered last week. While it works I'm uneasy making such a
> radical change upstream given how widely the global mutex is used hence
> the suggestion to have an explicit IOMem mutex.
>
> Actually I'm surprised the iothread muxtex just re-uses the global one.
> I guess I need to talk to the IO guys as to why they took that
> decision.
>
>>
>> Thanks,
>
> Thanks for your quick review :-)
>
>> Fred
>>
>>> IO Subsystem
>>> ------------
>>>
>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>> be no additional locking required once we reach the Block Driver.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>> - The dataplane should continue to be protected by the iothread locks
>>>
>>>
>>> References
>>> ==========
>>>
>>> [1]
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>
>>>
>>>
>
> --
> Alex Bennée
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton