dotgnu-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Random Docs about Pnet's JIT (was: [DotGNU]WHere does CVM fit into the p


From: Gopal V
Subject: Random Docs about Pnet's JIT (was: [DotGNU]WHere does CVM fit into the pnet picture?)
Date: Sun, 6 Jul 2003 23:36:23 +0530
User-agent: Mutt/1.2.5i

If memory serves me right, Gopal V wrote:
> *ahem* ... I think a dia or other picture of how pnet engine works
> is in order ?

Ok, since nobody else wanted to do it ... I did it :)

/me hints the minddog & mdupont should note this down ...

Main Engine
-----------
http://symonds.net/~gopalv82/code/pnet-engine-1.jpg

The main engine has been crafted to operate with minimum overhead
with both the CVM Coder and the Native unroller operating at O(N)
(the holy grail ?). The CVM conversion does require extra information 
about the stack and the local variable states , but since each
instruction is converted to CVM inline with verification , we save 
an extra pass for finding that information. And we do save a significant 
amount of processing by avoiding the creation and traversals of tree
based structures. The system therefore performs faster in comparison
to tree based JIT design as the cyclometric complexity of the code
increases (making SSA harder and harder). This in turn saves memory
and ensures that the first result is obtained as quickly as possible.

In fact the unroller can be considered as a level 1 JIT with 
instructions which are hard to JIT run with jumps to the interpreter
core (ie COP_CALL makes a jump to the static Code Segment region 
corresponding to the "VMCASE(COP_CALL)"). So we don't interpret, 
we jump from JIT code to interpreter code just as if we had JIT'd 
it normally. This incremental JIT system can be retro fitted with
a level 2 JIT which JITs selectively based on profile data. Getting 
the best of both worlds :)
(an example of this can be seen in the Sun HOTSPOT implementation)

So the percentage of conversion/JIT time per single execution
for Portable.net is generally very low. For tight loops the issue
is slightly different. Fortunately very few real life apps have 
tight loops , but very very unfortunately all wheelchair benchmarks
consist exclusively of tight loops, making this approach look bad.
But optimising for such benchmarks goes on to slow down real apps.

Execution
---------
http://symonds.net/~gopalv82/code/pnet-engine-2.jpg

The execution of direct threaded code (which is the fastest of the
pnet's engine flavours) combines JIT'd code with interpreter code
in such a way that the thunk (or cross jump) between the jump is
the overhead we have in comparison to a full JIT scenario. During
this the register operands are flushed to memory and loaded back
and a jump instruction is executed. Other than this minor change,
the system looks identical.

The unroller also has some local optimisations possible which
allow for optimising store/fetch and fetch/fetch cases . For
example the "stloc.1 ldloc.1" sequence involves only one single
memory operation. (ie it goes as reg->mem, reg->reg instead of
reg->mem, mem->reg). Similarly for "ldloc.1 ldloc.1 ldloc.1",
etc. All these small and simple things go on to add up a significant
speed improvement with minimal development time .

Pnet's x86 unroller (or mixed JIT) was written in under 2 weeks 
and the ARM one in about a week and a bit all by one person (ok, 
that was Rhys, but that's beyond the point ;). This should be 
drop to even less when the "generic" unroller system gets working, 
which has a basic interface of simple calls like 

                /* add an immediate value to a register */
                md_add_reg_imm(inst,reg,imm)

And combined register allocation by default. A significant part of 
it is already in CVS, but not enabled, look for md_<cpu>* files. 
When this is finished, it should be child's play (of course the 
"child" has to know the CPU or have good docs ;) to plug in a new 
CPU backend and fill in the register priority lists . Portable.net 
is full of frontend-backend stuff, even 3 tier systems like IL->CVM->x86.
It's all modular and the beauty of it is how it all fits together.

The code in the interpreter loop is often better optimised and 
faster than JIT'd code and this code is gcc generated and optimised.
So jumping to this code using computed goto's we remove almost
all the disadvantages of a portable interpreter with respect to 
speed. CVM being non-polymorphic is more ideal to interpret than
poymorphic IL . All in all, this trade off of speed and portability
makes for a jolly good show :)

> PS: I wonder what ^Tum would have faced if he had to hack "monitorenter"
>     for x86 , Arm and god knows what JIT seperately :)

And that too :)

All this falls into place if you've already read Rhys's nice paper
on Portable.net engine design.

Gopal
-- 
The difference between insanity and genius is measured by success


reply via email to

[Prev in Thread] Current Thread [Next in Thread]