[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance

From: KHMan
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Wed, 21 Dec 2016 11:29:53 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 12/20/2016 10:17 PM, David Mertens wrote:
Hello Kein-Hong,

I'm not convinced this is entirely an unpredictable hardware
issue. The reason is that I can easily create similar
functionality with gcc (the usual Perl XS module, the normal means
for writing a C-based extension) and it does not show these kinds
of cache swings. I think there is something gcc does while
producing its machine code that makes it less susceptible to cache
misses. (Well, there are lots of things it does, I'm sure.) I'm
hoping there's one or two simple things that gcc does which tcc
misses and could implement.

Was the behavior observed with Lua noted when working with JIT?

I couldn't find the old posting but it was along the lines of benchmark variability due to memory layout, see "Mytkowicz memory layout". IIRC, the discussion was about a small benchmark Lua script running the interpreter, in one posting, changing an environment variable changed the program's total running time significantly, IIRC it was in the 20-50% range. The timings were done casually and nobody did detailed follow-up research.

... which of course are the same executables and is different from your case. Long day and all. But tcc is not much of an optimizing compiler, if the change caused register spilling in an inner loop it would hammer memory access and account for at least some of the effects...

On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:

    On 12/20/2016 9:16 PM, David Mertens wrote:

        Hello everyone,

        Reminder/Background: C::Blocks is my Perl wrapper around
        my fork
        of tcc with extended symbol table support.

        I've begun writing benchmarks to seriously test how C::Blocks
        compares with other JIT and JIT-ish options for Perl. I've
        a couple of situations in which slight modifications to
        the code
        cause a huge drop in performance. One benchmark went from
        370ms to
        5,000ms (i.e. 5 sec).

        The change to the code was so slight that I immediately
        cache misses as the culprit. Running with linux's "perf"
        gave proof of that (hopefully this format properly with
        fixed-width characters):

                        Fast    Slow  Significant
        time (ms)      370    5022    **
        instructions  3.5B    3.5B
        branches      640M    650M
        branch-miss   687k    671k
        dcache-miss   974k     71M    **
        icache-miss   3.2M     83M    **

        By dcache-miss I refer to what perf calls "L1 dcache load
        and by icache-miss I refer to what perf calls "L1 icache
        load miss".

        I'm a bit confused on what would cause this sort of persistent
        cache miss behavior. In particular, I've tried working
        with highly
        distinct strategies for managing executable memory, including
        ensuring page alignment (wrong: it should be line-width
        of 64 bytes). This fixed a similar issue previously
        observed, but
        didn't seem to improve the situation here. I used malloc
        of Perl's built-in memory allocator. I created a pool for
        executable memory so that multiple chunks of executable
        code would
        all be written to the same page in memory. EVEN THIS did
        not fix
        this issue, which really surprised me since I would have
        adjacent memory would hash to different caches.

        I believe that what I've found is an issue with tcc, but I
        golfed it down to a simple libtcc-consuming example. I can do
        that, but wanted to see if anybody could think of an obvious
        cause, and fix, without going to such lengths. If not, I
        will see
        if I can write a small reproducible example.

    This kind of behaviour was discussed on the Lua list not long
    ago. IIRC, for example changing environment variables changed
    the way a program is loaded, and the timing changed. Probably
    cache behaviour. It's like, what can we really benchmark anymore?

    When modern GHz parts have cache misses and need to access
    main memory, they cause such train wrecks that everybody seems
    to be moving or have already moved to neural network-based
    (perceptron *cough*) branch prediction.

    So well, how do we scientifically or meaningfully benchmark
    these days, that is the question... (especially for folks in
    academic needing to justify benchmark results...)

Kein-Hong Man (esq.)
Selangor, Malaysia

reply via email to

[Prev in Thread] Current Thread [Next in Thread]