Re: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc g

gcl-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc g

From:	Camm Maguire
Subject:	Re: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc generatedassembly.
Date:	28 Oct 2005 00:49:42 -0400
User-agent:	Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
Greetings, Mike and thanks so much for this very relevant information!

We should definitely keep an eye on this.  If I recall, I've done a
little testing myself in this regard and couldn't find much penalty in
our copious use of labels and gotos.

Tried the fac example for example:

=============================================================================
(defun fac (n r)

  (case n
        (1 r)
        (otherwise (fac (1- n) (* n r)))))

FAC

>(proclaim '(ftype (function (fixnum fixnum) fixnum) fac))

NIL

>(let ((compiler::*opt-three* "-O3")) (disassemble 'fac ))

;; Compiling gazonk2.lsp.
;; End of Pass 1.  
;; End of Pass 2.  
;; OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3, 
(Debug quality ignored)
;; Finished compiling gazonk2.o.

#include "gazonk2.h"
void init_code(){do_init(VV);}
/*      local entry for function FAC    */

static fixnum LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2(V3,V4)

register fixnum V3;register fixnum V4;
{        VMB1 VMS1 VMV1
        goto TTL;
TTL:;switch(V3){
        case 1:
        goto T2;
T2:;
        {fixnum V5 = V4;
        VMR1(V5)}
        default:
        goto T3;
T3:;
        {fixnum V6;
        V6= (fixnum)(V3)-((fixnum)1);
        V4= (fixnum)(V3)*(V4);
        V3= V6;}
        goto TTL;
        {fixnum V7 = fix(Cnil);
        VMR1(V7)}}
        {fixnum V8 = fix(Cnil);
        VMR1(V8)}
}
       
#(
#((system::%init . #((system::mfsfun (lisp::quote common-lisp-user::fac) 0 
20738))))
)

static fixnum LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2();
#define VMB1
#define VMS1
#define VMV1
#define VMR1(VMT1) return(VMT1);
#define VM1 0
static char * VVi[1]={
#define Cdata VV[0]
(char *)(LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2)
};
#define VV ((object *)VVi)

gazonk2.o:     file format elf32-i386

Disassembly of section .text:

00000000 <init_code>:
init_code():
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5686
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 14                sub    $0x14,%esp
   6:   68 00 00 00 00          push   $0x0
   b:   e8 fc ff ff ff          call   c <init_code+0xc>
  10:   83 c4 10                add    $0x10,%esp
  13:   c9                      leave  
  14:   c3                      ret    
  15:   8d 74 26 00             lea    0x0(%esi),%esi
  19:   8d bc 27 00 00 00 00    lea    0x0(%edi),%edi

00000020 <LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2>:
LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2():
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5692
  20:   55                      push   %ebp
  21:   89 e5                   mov    %esp,%ebp
  23:   8b 55 08                mov    0x8(%ebp),%edx
  26:   8b 45 0c                mov    0xc(%ebp),%eax
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5694
  29:   83 fa 01                cmp    $0x1,%edx
  2c:   74 0b                   je     39 
<LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2+0x19>
  2e:   89 f6                   mov    %esi,%esi
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5705
  30:   0f af c2                imul   %edx,%eax
  33:   4a                      dec    %edx
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5694
  34:   83 fa 01                cmp    $0x1,%edx
  37:   75 f7                   jne    30 
<LI1__FAC___t1_camm_debian_gcl_tmp_tmp_foo_gazonk2+0x10>
/fix/t1/camm/debian/gcl/tmp/tmp/foo/gazonk2.c:5712
  39:   5d                      pop    %ebp
  3a:   c3                      ret    
NIL

>=============================================================================

This sample is surely not conclusive, but you see we are getting the
full jhc optimization even though our compiler rewrites the tail call
to a goto.  Haven't yet seen how critical the switch statement is
here to gcc's output, but in more complex cases switch is clearly the
way to go, and fortunately gcl's compiler writes integer case
statements as explicit switch statements.  Needs to be extended in
straightforward fashion to character cases too.  On my todo list.

Also a bit interesting that gcc duplicated the comparison at the end
of the block, conceivably as an optimization to avoid the jump to top
if possible.

Take care,

"Mike Thomas" <address@hidden> writes:

> Camm,
> 
> FYI regarding code generation and helping gcc to do a good job on code 
> generated by GCL. It remains to be seen what further debate on this issue 
> turns up but the basics principle of emitting higher level C flow control 
> constructs rather than gotos and labels is probably worth noting.
> 
> Cheers
> 
> Mike Thomas
> 
> -----Original Message-----
> From: address@hidden [mailto:address@hidden On Behalf Of John Meacham
> Sent: Wednesday, 26 October 2005 7:44 PM
> To: address@hidden
> Subject: jhc vs ghc and the surprising result involving ghc generatedassembly.
> 
> 
> (I apologize in advance if this message seems self congradulatory, but after 
> a long time of being disheartened by jhc only having marginal gains over ghc, 
> I am finally seeing some substantial benefits, many of which are the result 
> of optimizations that can actually be ported back to ghc)
> 
> 
> So, what started out as a simple pat on my own back for getting strictness 
> and CPR finally working turned into an adventure in assembly language.  An 
> incidental comparason between jhc and ghc's output completly surprised me, 
> for tight mathematical inner loops jhc is producing code that runs 3-7x 
> faster than ghc.
> 
> This surprised me because I expected them to be identical, they determine the 
> same strictness and produce the same worker/wrapper split, but due to a quirk 
> of how ghc generates C code, it ends up producing an ultimatly slower result.
> fortunatly, I think the problem is easy to mitigate. (and jhc loses its lead 
> once again :) )
> 
> (all the following snippets are unedited/unreformatted output of the programs 
> specified except for the removal of some profiling tags)
> 
> 
> the motivating example is the plain old basic accumulating factorial function:
> 
> fac :: Int -> Int -> Int
> fac 1 r = r
> fac n r = fac (n - 1) (n*r)
> 
> the strictness info is correctly infered that fac is strict in all its 
> arguments and returns a CAF so it is translated into a nice and speedy 
> version that takes unboxed arguments and returns unboxed results.
> 
> here is the C code that jhc generates. (As an aside, I am very proud of how 
> readable and how much structure the jhc generated C code preserves of the 
> original haskell. it's a small thing, perhaps only implementors appreciate 
> it, but I am glad I spent the time needed to do so.)
> 
> /* address@hidden */
> static int
> fWXAXDfMainXDfac(int v94, int v95)
> {
>         int v96;
>         int v97;
>         int v98;
>         int v99;
>         switch (v94) {
>         case 1:
>             return v95;
>             break;
>         default:
>             v96 = v94;
>             v97 = (int)(v94 - 1);
>             v98 = (int)(v94 * v95);
>             v99 = fWXAXDfMainXDfac(v97, v98);
>             return v99;
>         break;
>         }
> }
> 
> notice that besides being a bit verbose and using a tailcall, this is exactly 
> what A C programmer would write. In fact, the generated assembly is quite 
> nice and I think perhaps provably optimal even compared to hand-coded 
> assembly :)
> 
> jhc assembly output:
> 
> fWXAXDfMainXDfac:
> .L108:
>       cmpl    $1, %edi
>       je      .L110
>       imull   %edi, %esi
>       decl    %edi
>       jmp     .L108
> .L110:
>       movl    %esi, %eax
>       ret
> 
> 
> notice, a tiny inner loop, 4 instructions and one conditional jump. no memory 
> acceses whatsoever.
> 
> 
> now, lets look at what ghc does with the same function:
> 
> the generated C code:
> 
> EI_(Main_zdwfac_info);
> FN_(Main_zdwfac_entry) {
> W_ _s27g;
> W_ _s27i;
> W_ _s27l;
> FB_
> _s27g = *Sp;
> if (_s27g != 0x1UL) goto _c282;
> R1.p = (P_)(Sp[1]);
> Sp=Sp+2;
> JMP_(*Sp);
> _c282:
> _s27l = _s27g * (Sp[1]);
> _s27i = _s27g - 0x1UL;
> Sp[1] = _s27l;
> *Sp = _s27i;
> JMP_((W_)&Main_zdwfac_info);
> FE_
> }
> 
> it looks complicated, but what it effectivly does is pop the arguments off 
> the stack, run the code but with an explicit jump rather than recursion and 
> push the result back onto the stack. other than the stack stuff at the 
> beginnig and end, we would expect this to get compiled to roughly the same 
> assembly as the jhc version with a nice tight inner loop and just some stack 
> futzing boilerplate.
> 
> and now the generated assembly.
> 
> Main_zdwfac_info:
> .text
>       .align 8
>       .text
>       movq    (%rbp), %rdx
>       cmpq    $1, %rdx
>       jne     .L2
>       movq    8(%rbp), %r13
>       leaq    16(%rbp), %rbp
>       movq    (%rbp), %rax
> .L4:
>       jmp     *%rax
> .L2:
>       movq    %rdx, %rax
>       imulq   8(%rbp), %rax
>       movq    %rax, 8(%rbp)
>       leaq    -1(%rdx), %rax
>       movq    %rax, (%rbp)
>       movl    $Main_zdwfac_info, %eax
>       jmp     .L4
> 
> 
> ack! lets count what happens on each iteration of the loop: we have 5 (!) 
> memory acceses and two jumps, one of them being indirect! in fact, each time 
> through the loop, it loads the same values into the same registers.
> 
> 
> this is really bad compared to the jhc assembly. the basic issue is an 
> interaction between ghc's use of global registers, a stack, and indirect 
> calls.
> 
> an indirect jump is very expensive. modern processors are pipelined, having 
> many instructions in the queue at once, looking ahead and beginning to 
> evaluate what is coming up. when an indirect jump is encountered, the CPU has 
> no choice but to flush the whole pipeline because it has no idea where it 
> goes, even with conditional branches cpus can predict with good accuracy 
> which branch will be taken and at worst, it discards the pipeline, but with 
> an indirect jump, there is no chance of it having a full pipeline.
> 
> 
> However the major issue is the following.  %rbp is the global stack pointer 
> pointing to the STG stack, since it is global it can be modified from 
> anywhere, since gcc can't know if the function it jumps to modified said 
> register it has no choice but to load all values relative to it every time 
> through the loop because it may have changed.
> 
> furthermore gotos and labels are very problematic for gcc to optimize around.
> for various tiresome reasons gcc cannot perform (most) code motion 
> optimizations across explicit labels and gotos, especially when they deal 
> with the global register variables and memory stores and loads. since these 
> are arguably some of the most important optimizations, this is quite bad for 
> the generated assembly.
> 
> loop:
> 
> if () goto loop;
> 
> is not equivalent to a do-while loop, loop invarients cannot be hoisted out 
> of the above for instance (except in some cases... it is all quite tricky and 
> we want gcc to have as much freedom as possible).
> 
> all in all, this makes the code generated by gcc compiling something 
> generated by ghc not very good at all.
> 
> there are a couple of things we can do to mitigate these problems:
> 
> get rid of indirect jumps whenever possible.
> 
> use C control constructs rather than gotos.
> 
> 
> A couple simple rules seem to help greatly.
> 
> * turn anything of the form JMP_((W_)&self) where self is oneself into a goto 
> that gotos a label at the beginning of the function.
> 
> * do simple pattern matthing on the basic blocks to recognize where C control 
> constructs can be placed.
> 
> for instance turn
> 
> if (x) { goto  y; }
> blah..
> baz..
> JMP_(foo)
> 
> into
> 
> if (x) { goto  y; } else {
> blah..
> baz..
> JMP_(foo)
> }
> 
> extending the else to after the next jump or goto.
> 
> 
> * getting stack dereferences out of your loops.
> 
> 
> manually performing the first two optimizations mentioned above we get:
> 
> EI_(Main_zdwfac_info);
> FN_(Main_zdwfac_entry) {
> W_ _s27g;
> W_ _s27i;
> W_ _s27l;
> FB_
> fac_entry:
> _s27g = *Sp;
> if (_s27g != 0x1UL) { goto _c282; } else { R1.p = (P_)(Sp[1]); Sp=Sp+2; 
> JMP_(*Sp); }
> _c282:
> _s27l = _s27g * (Sp[1]);
> _s27i = _s27g - 0x1UL;
> Sp[1] = _s27l;
> *Sp = _s27i;
> goto fac_entry;
> FE_
> }
> 
> and this produces the assembly:
> 
> Main_zdwfac_info:
> .text
>       .align 8
>       .text
>       movq    %rbp, %rdx
>       movq    (%rbp), %rcx
>       cmpq    $1, %rcx
>       je      .L3
> .L6:
>       movq    %rcx, %rax
>       imulq   8(%rdx), %rax
>       movq    %rax, 8(%rdx)
>       leaq    -1(%rcx), %rax
>       movq    %rax, (%rbp)
> .L4:
>       movq    %rbp, %rdx
>       movq    %rax, %rcx
>       cmpq    $1, %rax
>       jne     .L6
> .L3:
>       movq    8(%rdx), %r13
>       leaq    16(%rdx), %rbp
>       jmp     *(%rbp)
> 
> we still have some unnecesarry memory accesses, but the indirect jump and the 
> spurious jump are gone and we have less instructions in the main loop making 
> this code noticably faster.
> 
> in order to get rid of the unessesary memory accesess, we need to either
> 
> 1. fully convert it to use C control constructs, so gcc will do it for us.
> (code motion and loop invarient being inhibited again by gotos)
> 
> or
> 
> 2. do it ourselves by analyzing when the consumer of what we are putting on 
> the stack is ourselves.
> 
> the first is greatly preferable but not always possible.
> 
> 
> These should be straightforward to implement in the C code generator. it also 
> suggests we might want to try to use the native C calling convention on leaf 
> nodes that deal with unboxed values (so we get register passing and return 
> values for free) or taking groups of mutually recursive functions and turning 
> them all into one function with explicit jumps between them.
> 
> 
> 
> to show they are actually optimizing the same function, here are both of 
> their core representaions, other than syntax, they are identical:
> 
> jhc core: (uses unicode, utf8 formatted)
> 
>    address@hidden → int → int = λx9216∷int.λx9222∷int.(case x9216 of
>             1 → x9222 ;
>             x9238∷case <(int)-(int,int) x9216 1∷int> of
>                     x9252∷int → case <(int)*(int,int) x9216 x9222∷int> of
>                         x34∷int → case address@hidden x9252 x34 of
>                             x9208∷int → x9208;;;;;)
> 
> ghc core:
> 
>   %rec
>   {zdwfac :: GHCziPrim.Intzh -> GHCziPrim.Intzh -> GHCziPrim.Intzh =
>      \ (ww::GHCziPrim.Intzh) (ww1::GHCziPrim.Intzh) ->
>        %case (GHCziPrim.Intzh) ww %of (ds::GHCziPrim.Intzh)
>          {%_ ->
>             Main.zdwfac (GHCziPrim.zmzh ds (1::GHCziPrim.Intzh))
>             (GHCziPrim.ztzh ds ww1);
>           (1::GHCziPrim.Intzh) ->
>             ww1}};
> 
> 
> 
> some random notes:
> 
> the 3x-7x factor was tested on an i386, on x86_64 the margin is much much 
> greater for reasons that are still unclear.
> 
> while testing this I noticed that jhc and ghc have dramatically different 
> results on x86_64 pretty much across the board, if their programs take 
> comparable amounts of time on i386, the jhc one will run twice as fast as the 
> ghc one on x86_64. I think ghc must be tickling the x86_64 in a particularly 
> bad way for this dramatic of an effect to be observed. I will poke around the 
> generated assembly of ghc some more and see if I can find the culprit.
> 
> (this is also online at:  http://repetae.net/jhc-vs-ghc-assembly.txt )
> 
>         John
> 
> --
> John Meacham - ⑆repetae.net⑆john⑈
> _______________________________________________
> Glasgow-haskell-users mailing list
> address@hidden
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
> _______________________________________________
> Gcl-devel mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/gcl-devel

-- 
Camm Maguire                                            address@hidden
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah
[Prev in Thread]
Current Thread
[Next in Thread]
[Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc generatedassembly., Mike Thomas, 2005/10/26
- Re: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc generatedassembly., Camm Maguire <=
Prev by Date: [Gcl-devel] Re: Odd idea...
Next by Date: Re: [Gcl-devel] trying to make LISA working with GCL
Previous by thread: [Gcl-devel] FW: jhc vs ghc and the surprising result involving ghc generatedassembly.
Next by thread: [Gcl-devel] Re: too much static?
Index(es):
- Date
- Thread