gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] Boyer benchmark results


From: Vadim V. Zhytnikov
Subject: Re: [Gcl-devel] Boyer benchmark results
Date: Fri, 25 Jun 2004 07:45:29 +0400
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru-RU; rv:1.6) Gecko/20040407

Michael Koehne writes:
Moin Vadim V. Zhytnikov,


It looks like Pentium 4 (I mean 2.4GHz Xeon) desn't
like some CMUCL tests.


  P4 (i786 generation) is crippled by design :(

  - small L1 no No L3 data cache (Xeon is better in cache)
  - trace cache can only feed 3 micro ops per clock cycle, but P4 has
    7 hot and expensive execution units, who could process 9 micro
    ops per clock cycle.
  - decoder is crippled
    - only one decode per clock cylce (not the complex + simple
      grouping of 4+1+1 in PII&PIII-686, or U+V in Pentium-586)
    - average 21 clock cycle to decode 64 bytes, if the code is not
      in trace cache in addition after a normal L1 cache miss ;(
  - wrong choice of execution units
    - see above - only 3 micro ops (of 9) are feeded to 7 units
    - shift/rotate is slow compared to the old barrel shifter
    - partital registers are accessed using the slow shift/rotate unit
  - The main compiler claiming to do P4 optimisations is non-free and
    unable to compile most free software.

  instruction take much more clock cycles in result, and the CPU stalls
  often for L1 or trace cache miss. PIII or PII optimised code is a warranty
  to stall a P4 CPU most of the time.
But nevermind - customers buy MHZ ;)


Yes, this is why I don't have P4 on my desk and I newer will.
I have some little set of tests for Standard Lisp+Reduce.
Not so serious and comprehensive as Gabriel testsuite but
I collect result purely for curiosity sake and have figures for
very different hardware starting from VAX mainframes and 386
machines and up to modern ones.  Mostly test results
scale quite well with only one exception - on P4 some
tests looks OK but others are very slow - 2-3 times
slow than one would expect.


BTW, on Athlon I've tried -march=i586 vs -march=athlon
but found no difference.


  the Athlon is doing a good job in eating shit ;( means - it has 3
  decoders for complex instructions, who dont care about 4+1+1 or U+V
  pairing and feed as many micro ops to the execution units as they
  could handle with register renaming and out of order execution - the
  3rd floating point unit is also much better than also those MMX gaming
  gimmiks the P4 has, as its automaticaly used in normal programms.
  Its therefore only a small difference between optimised code -

  Because AMD is doing a good job in executing code that had been
  optimised for the old wrong 486,586 and 686 platforms ;)

  486 optimisations :
  - small loops
  - avoid 32 bit registers
  - working core <8kb cache
  - use simple instructions
  586 optimisations (for Pentium and Pentium MMX)
  - U+V pairing of one complex one simple instructions, and the simple
    does'nt cost a clock cycle
  686 optimisations (for PII & PIII)
  - 4+1+1 pairing, of one complex and two simple instructions, so the
    two simple 1 micro op instructions cost no clock cycle. 686 is able
    run at 5/6% utilisation with 586 optimised code.
  787 optimisations (for P4)
  - 3 micro ops per clock is full speed (sorry - P4 is that slow)
  - avoid a cache miss (impossible on a multi tasking system)
  AMD optimisations
  - dont care about
  - AMD has 3 full decoders to do the optimisation in hardware.

  have you tried -march=i686 to use 4+1+1 instead of U+V pairing ?

Bye Michael

I'll try.

--
     Vadim V. Zhytnikov

     <address@hidden>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]