[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance

From: Christian Jullien
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Wed, 21 Dec 2016 08:06:50 +0100

Recent performance boost (for me).

My OpenLisp Lisp has 3 execution modes (see

M1 - Interpreter
M2 - Compiler to a VM which is then interpreted
M3 - Compiler to C.

Not a surprise, tcc is generally the slowest of all C compilers I know,
something between 1.5 to 2.0x slower which is still great for the price!!
While M2 is supposed to be little bit faster than M1 and all systems (the
purpose of compiler), tcc had the world record be 20x to 30x SLOWER than
M3 is fair compared to VC++ or gcc (~1.8x slower).

With recent changes, my M2 benchmark goes from ~10s to ~1.9s, more than 5x
faster!! It now compares to other C compilers (M2 is faster than M1 - ~2.5


-----Original Message-----
From: Tinycc-devel [mailto:address@hidden
On Behalf Of KHMan
Sent: mercredi 21 d├ęcembre 2016 06:26
To: address@hidden
Subject: Re: [Tinycc-devel] Huge swings in cache performance

On 12/21/2016 12:02 PM, David Mertens wrote:
> I forgot to mention: the function in question is a simple random 
> number generator. It only contains 32-bit integer math operations, and 
> does not contain any loops. For this benchmark, the looping occurs at 
> the Perl level, so alignment optimizations for looping would not be 
> important here. (This lets me compare many different Perl-to-C 
> function invocation approaches to assess their speed.)

Curiouser and curiouser. The Icache misses really bothers me. Is it just L1
caches that blew up or are the L2 caches naughty as well? Would love to hear
more of your progress on this list...

> On Tue, Dec 20, 2016 at 10:44 PM, David Mertens wrote:
>     Discussion about alignment and execution speed for the Haskell
>     compiler: https://ghc.haskell.org/trac/ghc/ticket/8279
>     <https://ghc.haskell.org/trac/ghc/ticket/8279>
>     This discussion mentions why things should be aligned, and
>     gives some multi-byte no-ops that can be used for padding for
>     aligned loops.
> <http://stackoverflow.com/questions/18113995/performance-optimisations
> -of-x86-64-assembly-alignment-and-branch-prediction>
>     I came across a similar issue a few weeks ago, but I was able
>     to "fix" it by allocating more memory than I needed and then
>     relocating to an address within that allocation that was
>     aligned to the start of a page. This seemed to fix the problem
>     back then, but this new flavor of alignment woes is impervious
>     to such a trick.
>     David
>     On Tue, Dec 20, 2016 at 10:29 PM, KHMan <address@hidden
>     <mailto:address@hidden>> wrote:
>         On 12/20/2016 10:17 PM, David Mertens wrote:
>             Hello Kein-Hong,
>             I'm not convinced this is entirely an unpredictable
>             hardware
>             issue. The reason is that I can easily create similar
>             functionality with gcc (the usual Perl XS module, the
>             normal means
>             for writing a C-based extension) and it does not show
>             these kinds
>             of cache swings. I think there is something gcc does while
>             producing its machine code that makes it less
>             susceptible to cache
>             misses. (Well, there are lots of things it does, I'm
>             sure.) I'm
>             hoping there's one or two simple things that gcc does
>             which tcc
>             misses and could implement.
>             Was the behavior observed with Lua noted when working
>             with JIT?
>         I couldn't find the old posting but it was along the lines
>         of benchmark variability due to memory layout, see
>         "Mytkowicz memory layout". IIRC, the discussion was about
>         a small benchmark Lua script running the interpreter, in
>         one posting, changing an environment variable changed the
>         program's total running time significantly, IIRC it was in
>         the 20-50% range. The timings were done casually and
>         nobody did detailed follow-up research.
>         ... which of course are the same executables and is
>         different from your case. Long day and all. But tcc is not
>         much of an optimizing compiler, if the change caused
>         register spilling in an inner loop it would hammer memory
>         access and account for at least some of the effects...
>             On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:
>                  On 12/20/2016 9:16 PM, David Mertens wrote:
>                      Hello everyone,
>                      Reminder/Background: C::Blocks is my Perl
>             wrapper around
>                      my fork
>                      of tcc with extended symbol table support.
>                      I've begun writing benchmarks to seriously
>             test how C::Blocks
>                      compares with other JIT and JIT-ish options
>             for Perl. I've
>                      noticed
>                      a couple of situations in which slight
>             modifications to
>                      the code
>                      cause a huge drop in performance. One
>             benchmark went from
>                      370ms to
>                      5,000ms (i.e. 5 sec).
>                      The change to the code was so slight that I
>             immediately
>                      suspected
>                      cache misses as the culprit. Running with
>             linux's "perf"
>                      command
>                      gave proof of that (hopefully this format
>             properly with
>                      fixed-width characters):
>                                      Fast    Slow  Significant
>                      time (ms)      370    5022    **
>                      instructions  3.5B    3.5B
>                      branches      640M    650M
>                      branch-miss   687k    671k
>                      dcache-miss   974k     71M    **
>                      icache-miss   3.2M     83M    **
>                      By dcache-miss I refer to what perf calls "L1
>             dcache load
>                      miss",
>                      and by icache-miss I refer to what perf calls
>             "L1 icache
>                      load miss".
>                      I'm a bit confused on what would cause this
>             sort of persistent
>                      cache miss behavior. In particular, I've
>             tried working
>                      with highly
>                      distinct strategies for managing executable
>             memory, including
>                      ensuring page alignment (wrong: it should be
>             line-width
>                      alignment
>                      of 64 bytes). This fixed a similar issue
>             previously
>                      observed, but
>                      didn't seem to improve the situation here. I
>             used malloc
>                      instead
>                      of Perl's built-in memory allocator. I
>             created a pool for
>                      executable memory so that multiple chunks of
>             executable
>                      code would
>                      all be written to the same page in memory.
>             EVEN THIS did
>                      not fix
>                      this issue, which really surprised me since I
>             would have
>                      thought
>                      adjacent memory would hash to different caches.
>                      I believe that what I've found is an issue
>             with tcc, but I
>                      haven't
>                      golfed it down to a simple libtcc-consuming
>             example. I can do
>                      that, but wanted to see if anybody could
>             think of an obvious
>                      cause, and fix, without going to such
>             lengths. If not, I
>                      will see
>                      if I can write a small reproducible example.
>                  This kind of behaviour was discussed on the Lua
>             list not long
>                  ago. IIRC, for example changing environment
>             variables changed
>                  the way a program is loaded, and the timing
>             changed. Probably
>                  cache behaviour. It's like, what can we really
>             benchmark anymore?
>                  When modern GHz parts have cache misses and need
>             to access
>                  main memory, they cause such train wrecks that
>             everybody seems
>                  to be moving or have already moved to neural
>             network-based
>                  (perceptron *cough*) branch prediction.
>                  So well, how do we scientifically or meaningfully
>             benchmark
>                  these days, that is the question... (especially
>             for folks in
>                  academic needing to justify benchmark results...)

Kein-Hong Man (esq.)
Selangor, Malaysia

Tinycc-devel mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]