Moin Vadim V. Zhytnikov,
It looks like Pentium 4 (I mean 2.4GHz Xeon) desn't
like some CMUCL tests.
P4 (i786 generation) is crippled by design :(
- small L1 no No L3 data cache (Xeon is better in cache)
- trace cache can only feed 3 micro ops per clock cycle, but P4 has
7 hot and expensive execution units, who could process 9 micro
ops per clock cycle.
- decoder is crippled
- only one decode per clock cylce (not the complex + simple
grouping of 4+1+1 in PII&PIII-686, or U+V in Pentium-586)
- average 21 clock cycle to decode 64 bytes, if the code is not
in trace cache in addition after a normal L1 cache miss ;(
- wrong choice of execution units
- see above - only 3 micro ops (of 9) are feeded to 7 units
- shift/rotate is slow compared to the old barrel shifter
- partital registers are accessed using the slow shift/rotate unit
- The main compiler claiming to do P4 optimisations is non-free and
unable to compile most free software.
instruction take much more clock cycles in result, and the CPU stalls
often for L1 or trace cache miss. PIII or PII optimised code is a warranty
to stall a P4 CPU most of the time.
But nevermind - customers buy MHZ ;)
BTW, on Athlon I've tried -march=i586 vs -march=athlon
but found no difference.
the Athlon is doing a good job in eating shit ;( means - it has 3
decoders for complex instructions, who dont care about 4+1+1 or U+V
pairing and feed as many micro ops to the execution units as they
could handle with register renaming and out of order execution - the
3rd floating point unit is also much better than also those MMX gaming
gimmiks the P4 has, as its automaticaly used in normal programms.
Its therefore only a small difference between optimised code -
Because AMD is doing a good job in executing code that had been
optimised for the old wrong 486,586 and 686 platforms ;)
486 optimisations :
- small loops
- avoid 32 bit registers
- working core <8kb cache
- use simple instructions
586 optimisations (for Pentium and Pentium MMX)
- U+V pairing of one complex one simple instructions, and the simple
does'nt cost a clock cycle
686 optimisations (for PII & PIII)
- 4+1+1 pairing, of one complex and two simple instructions, so the
two simple 1 micro op instructions cost no clock cycle. 686 is able
run at 5/6% utilisation with 586 optimised code.
787 optimisations (for P4)
- 3 micro ops per clock is full speed (sorry - P4 is that slow)
- avoid a cache miss (impossible on a multi tasking system)
AMD optimisations
- dont care about
- AMD has 3 full decoders to do the optimisation in hardware.
have you tried -march=i686 to use 4+1+1 instead of U+V pairing ?
Bye Michael