gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc gener


From: Mike Thomas
Subject: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc generatedassembly.
Date: Thu, 27 Oct 2005 10:12:53 +1000

Camm,

FYI regarding code generation and helping gcc to do a good job on code 
generated by GCL. It remains to be seen what further debate on this issue turns 
up but the basics principle of emitting higher level C flow control constructs 
rather than gotos and labels is probably worth noting.

Cheers

Mike Thomas

-----Original Message-----
From: address@hidden [mailto:address@hidden On Behalf Of John Meacham
Sent: Wednesday, 26 October 2005 7:44 PM
To: address@hidden
Subject: jhc vs ghc and the surprising result involving ghc generatedassembly.


(I apologize in advance if this message seems self congradulatory, but after a 
long time of being disheartened by jhc only having marginal gains over ghc, I 
am finally seeing some substantial benefits, many of which are the result of 
optimizations that can actually be ported back to ghc)


So, what started out as a simple pat on my own back for getting strictness and 
CPR finally working turned into an adventure in assembly language.  An 
incidental comparason between jhc and ghc's output completly surprised me, for 
tight mathematical inner loops jhc is producing code that runs 3-7x faster than 
ghc.

This surprised me because I expected them to be identical, they determine the 
same strictness and produce the same worker/wrapper split, but due to a quirk 
of how ghc generates C code, it ends up producing an ultimatly slower result.
fortunatly, I think the problem is easy to mitigate. (and jhc loses its lead 
once again :) )

(all the following snippets are unedited/unreformatted output of the programs 
specified except for the removal of some profiling tags)


the motivating example is the plain old basic accumulating factorial function:

fac :: Int -> Int -> Int
fac 1 r = r
fac n r = fac (n - 1) (n*r)

the strictness info is correctly infered that fac is strict in all its 
arguments and returns a CAF so it is translated into a nice and speedy version 
that takes unboxed arguments and returns unboxed results.

here is the C code that jhc generates. (As an aside, I am very proud of how 
readable and how much structure the jhc generated C code preserves of the 
original haskell. it's a small thing, perhaps only implementors appreciate it, 
but I am glad I spent the time needed to do so.)

/* address@hidden */
static int
fWXAXDfMainXDfac(int v94, int v95)
{
        int v96;
        int v97;
        int v98;
        int v99;
        switch (v94) {
        case 1:
            return v95;
            break;
        default:
            v96 = v94;
            v97 = (int)(v94 - 1);
            v98 = (int)(v94 * v95);
            v99 = fWXAXDfMainXDfac(v97, v98);
            return v99;
        break;
        }
}

notice that besides being a bit verbose and using a tailcall, this is exactly 
what A C programmer would write. In fact, the generated assembly is quite nice 
and I think perhaps provably optimal even compared to hand-coded assembly :)

jhc assembly output:

fWXAXDfMainXDfac:
.L108:
        cmpl    $1, %edi
        je      .L110
        imull   %edi, %esi
        decl    %edi
        jmp     .L108
.L110:
        movl    %esi, %eax
        ret


notice, a tiny inner loop, 4 instructions and one conditional jump. no memory 
acceses whatsoever.


now, lets look at what ghc does with the same function:

the generated C code:

EI_(Main_zdwfac_info);
FN_(Main_zdwfac_entry) {
W_ _s27g;
W_ _s27i;
W_ _s27l;
FB_
_s27g = *Sp;
if (_s27g != 0x1UL) goto _c282;
R1.p = (P_)(Sp[1]);
Sp=Sp+2;
JMP_(*Sp);
_c282:
_s27l = _s27g * (Sp[1]);
_s27i = _s27g - 0x1UL;
Sp[1] = _s27l;
*Sp = _s27i;
JMP_((W_)&Main_zdwfac_info);
FE_
}

it looks complicated, but what it effectivly does is pop the arguments off the 
stack, run the code but with an explicit jump rather than recursion and push 
the result back onto the stack. other than the stack stuff at the beginnig and 
end, we would expect this to get compiled to roughly the same assembly as the 
jhc version with a nice tight inner loop and just some stack futzing 
boilerplate.

and now the generated assembly.

Main_zdwfac_info:
.text
        .align 8
        .text
        movq    (%rbp), %rdx
        cmpq    $1, %rdx
        jne     .L2
        movq    8(%rbp), %r13
        leaq    16(%rbp), %rbp
        movq    (%rbp), %rax
.L4:
        jmp     *%rax
.L2:
        movq    %rdx, %rax
        imulq   8(%rbp), %rax
        movq    %rax, 8(%rbp)
        leaq    -1(%rdx), %rax
        movq    %rax, (%rbp)
        movl    $Main_zdwfac_info, %eax
        jmp     .L4


ack! lets count what happens on each iteration of the loop: we have 5 (!) 
memory acceses and two jumps, one of them being indirect! in fact, each time 
through the loop, it loads the same values into the same registers.


this is really bad compared to the jhc assembly. the basic issue is an 
interaction between ghc's use of global registers, a stack, and indirect calls.

an indirect jump is very expensive. modern processors are pipelined, having 
many instructions in the queue at once, looking ahead and beginning to evaluate 
what is coming up. when an indirect jump is encountered, the CPU has no choice 
but to flush the whole pipeline because it has no idea where it goes, even with 
conditional branches cpus can predict with good accuracy which branch will be 
taken and at worst, it discards the pipeline, but with an indirect jump, there 
is no chance of it having a full pipeline.


However the major issue is the following.  %rbp is the global stack pointer 
pointing to the STG stack, since it is global it can be modified from anywhere, 
since gcc can't know if the function it jumps to modified said register it has 
no choice but to load all values relative to it every time through the loop 
because it may have changed.

furthermore gotos and labels are very problematic for gcc to optimize around.
for various tiresome reasons gcc cannot perform (most) code motion 
optimizations across explicit labels and gotos, especially when they deal with 
the global register variables and memory stores and loads. since these are 
arguably some of the most important optimizations, this is quite bad for the 
generated assembly.

loop:

if () goto loop;

is not equivalent to a do-while loop, loop invarients cannot be hoisted out of 
the above for instance (except in some cases... it is all quite tricky and we 
want gcc to have as much freedom as possible).

all in all, this makes the code generated by gcc compiling something generated 
by ghc not very good at all.

there are a couple of things we can do to mitigate these problems:

get rid of indirect jumps whenever possible.

use C control constructs rather than gotos.


A couple simple rules seem to help greatly.

* turn anything of the form JMP_((W_)&self) where self is oneself into a goto 
that gotos a label at the beginning of the function.

* do simple pattern matthing on the basic blocks to recognize where C control 
constructs can be placed.

for instance turn

if (x) { goto  y; }
blah..
baz..
JMP_(foo)

into

if (x) { goto  y; } else {
blah..
baz..
JMP_(foo)
}

extending the else to after the next jump or goto.


* getting stack dereferences out of your loops.


manually performing the first two optimizations mentioned above we get:

EI_(Main_zdwfac_info);
FN_(Main_zdwfac_entry) {
W_ _s27g;
W_ _s27i;
W_ _s27l;
FB_
fac_entry:
_s27g = *Sp;
if (_s27g != 0x1UL) { goto _c282; } else { R1.p = (P_)(Sp[1]); Sp=Sp+2; 
JMP_(*Sp); }
_c282:
_s27l = _s27g * (Sp[1]);
_s27i = _s27g - 0x1UL;
Sp[1] = _s27l;
*Sp = _s27i;
goto fac_entry;
FE_
}

and this produces the assembly:

Main_zdwfac_info:
.text
        .align 8
        .text
        movq    %rbp, %rdx
        movq    (%rbp), %rcx
        cmpq    $1, %rcx
        je      .L3
.L6:
        movq    %rcx, %rax
        imulq   8(%rdx), %rax
        movq    %rax, 8(%rdx)
        leaq    -1(%rcx), %rax
        movq    %rax, (%rbp)
.L4:
        movq    %rbp, %rdx
        movq    %rax, %rcx
        cmpq    $1, %rax
        jne     .L6
.L3:
        movq    8(%rdx), %r13
        leaq    16(%rdx), %rbp
        jmp     *(%rbp)

we still have some unnecesarry memory accesses, but the indirect jump and the 
spurious jump are gone and we have less instructions in the main loop making 
this code noticably faster.

in order to get rid of the unessesary memory accesess, we need to either

1. fully convert it to use C control constructs, so gcc will do it for us.
(code motion and loop invarient being inhibited again by gotos)

or

2. do it ourselves by analyzing when the consumer of what we are putting on the 
stack is ourselves.

the first is greatly preferable but not always possible.


These should be straightforward to implement in the C code generator. it also 
suggests we might want to try to use the native C calling convention on leaf 
nodes that deal with unboxed values (so we get register passing and return 
values for free) or taking groups of mutually recursive functions and turning 
them all into one function with explicit jumps between them.



to show they are actually optimizing the same function, here are both of their 
core representaions, other than syntax, they are identical:

jhc core: (uses unicode, utf8 formatted)

   address@hidden → int → int = λx9216∷int.λx9222∷int.(case x9216 of
            1 → x9222 ;
            x9238∷case <(int)-(int,int) x9216 1∷int> of
                    x9252∷int → case <(int)*(int,int) x9216 x9222∷int> of
                        x34∷int → case address@hidden x9252 x34 of
                            x9208∷int → x9208;;;;;)

ghc core:

  %rec
  {zdwfac :: GHCziPrim.Intzh -> GHCziPrim.Intzh -> GHCziPrim.Intzh =
     \ (ww::GHCziPrim.Intzh) (ww1::GHCziPrim.Intzh) ->
         %case (GHCziPrim.Intzh) ww %of (ds::GHCziPrim.Intzh)
           {%_ ->
              Main.zdwfac (GHCziPrim.zmzh ds (1::GHCziPrim.Intzh))
              (GHCziPrim.ztzh ds ww1);
            (1::GHCziPrim.Intzh) ->
              ww1}};



some random notes:

the 3x-7x factor was tested on an i386, on x86_64 the margin is much much 
greater for reasons that are still unclear.

while testing this I noticed that jhc and ghc have dramatically different 
results on x86_64 pretty much across the board, if their programs take 
comparable amounts of time on i386, the jhc one will run twice as fast as the 
ghc one on x86_64. I think ghc must be tickling the x86_64 in a particularly 
bad way for this dramatic of an effect to be observed. I will poke around the 
generated assembly of ghc some more and see if I can find the culprit.

(this is also online at:  http://repetae.net/jhc-vs-ghc-assembly.txt )

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Glasgow-haskell-users mailing list
address@hidden
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

reply via email to

[Prev in Thread] Current Thread [Next in Thread]