[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance

From: David Mertens
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Tue, 20 Dec 2016 23:02:08 -0500

I forgot to mention: the function in question is a simple random number generator. It only contains 32-bit integer math operations, and does not contain any loops. For this benchmark, the looping occurs at the Perl level, so alignment optimizations for looping would not be important here. (This lets me compare many different Perl-to-C function invocation approaches to assess their speed.)

On Tue, Dec 20, 2016 at 10:44 PM, David Mertens <address@hidden> wrote:
Discussion about alignment and execution speed for the Haskell compiler: https://ghc.haskell.org/trac/ghc/ticket/8279

This discussion mentions why things should be aligned, and gives some multi-byte no-ops that can be used for padding for aligned loops. http://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction

I came across a similar issue a few weeks ago, but I was able to "fix" it by allocating more memory than I needed and then relocating to an address within that allocation that was aligned to the start of a page. This seemed to fix the problem back then, but this new flavor of alignment woes is impervious to such a trick.


On Tue, Dec 20, 2016 at 10:29 PM, KHMan <address@hidden> wrote:
On 12/20/2016 10:17 PM, David Mertens wrote:
Hello Kein-Hong,

I'm not convinced this is entirely an unpredictable hardware
issue. The reason is that I can easily create similar
functionality with gcc (the usual Perl XS module, the normal means
for writing a C-based extension) and it does not show these kinds
of cache swings. I think there is something gcc does while
producing its machine code that makes it less susceptible to cache
misses. (Well, there are lots of things it does, I'm sure.) I'm
hoping there's one or two simple things that gcc does which tcc
misses and could implement.

Was the behavior observed with Lua noted when working with JIT?

I couldn't find the old posting but it was along the lines of benchmark variability due to memory layout, see "Mytkowicz memory layout". IIRC, the discussion was about a small benchmark Lua script running the interpreter, in one posting, changing an environment variable changed the program's total running time significantly, IIRC it was in the 20-50% range. The timings were done casually and nobody did detailed follow-up research.

... which of course are the same executables and is different from your case. Long day and all. But tcc is not much of an optimizing compiler, if the change caused register spilling in an inner loop it would hammer memory access and account for at least some of the effects...

On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:

    On 12/20/2016 9:16 PM, David Mertens wrote:

        Hello everyone,

        Reminder/Background: C::Blocks is my Perl wrapper around
        my fork
        of tcc with extended symbol table support.

        I've begun writing benchmarks to seriously test how C::Blocks
        compares with other JIT and JIT-ish options for Perl. I've
        a couple of situations in which slight modifications to
        the code
        cause a huge drop in performance. One benchmark went from
        370ms to
        5,000ms (i.e. 5 sec).

        The change to the code was so slight that I immediately
        cache misses as the culprit. Running with linux's "perf"
        gave proof of that (hopefully this format properly with
        fixed-width characters):

                        Fast    Slow  Significant
        time (ms)      370    5022    **
        instructions  3.5B    3.5B
        branches      640M    650M
        branch-miss   687k    671k
        dcache-miss   974k     71M    **
        icache-miss   3.2M     83M    **

        By dcache-miss I refer to what perf calls "L1 dcache load
        and by icache-miss I refer to what perf calls "L1 icache
        load miss".

        I'm a bit confused on what would cause this sort of persistent
        cache miss behavior. In particular, I've tried working
        with highly
        distinct strategies for managing executable memory, including
        ensuring page alignment (wrong: it should be line-width
        of 64 bytes). This fixed a similar issue previously
        observed, but
        didn't seem to improve the situation here. I used malloc
        of Perl's built-in memory allocator. I created a pool for
        executable memory so that multiple chunks of executable
        code would
        all be written to the same page in memory. EVEN THIS did
        not fix
        this issue, which really surprised me since I would have
        adjacent memory would hash to different caches.

        I believe that what I've found is an issue with tcc, but I
        golfed it down to a simple libtcc-consuming example. I can do
        that, but wanted to see if anybody could think of an obvious
        cause, and fix, without going to such lengths. If not, I
        will see
        if I can write a small reproducible example.

    This kind of behaviour was discussed on the Lua list not long
    ago. IIRC, for example changing environment variables changed
    the way a program is loaded, and the timing changed. Probably
    cache behaviour. It's like, what can we really benchmark anymore?

    When modern GHz parts have cache misses and need to access
    main memory, they cause such train wrecks that everybody seems
    to be moving or have already moved to neural network-based
    (perceptron *cough*) branch prediction.

    So well, how do we scientifically or meaningfully benchmark
    these days, that is the question... (especially for folks in
    academic needing to justify benchmark results...)

Kein-Hong Man (esq.)
Selangor, Malaysia

Tinycc-devel mailing list

 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]