[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance

From: KHMan
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Wed, 21 Dec 2016 13:26:13 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 12/21/2016 12:02 PM, David Mertens wrote:
I forgot to mention: the function in question is a simple random
number generator. It only contains 32-bit integer math operations,
and does not contain any loops. For this benchmark, the looping
occurs at the Perl level, so alignment optimizations for looping
would not be important here. (This lets me compare many different
Perl-to-C function invocation approaches to assess their speed.)

Curiouser and curiouser. The Icache misses really bothers me. Is it just L1 caches that blew up or are the L2 caches naughty as well? Would love to hear more of your progress on this list...

On Tue, Dec 20, 2016 at 10:44 PM, David Mertens wrote:

    Discussion about alignment and execution speed for the Haskell
    compiler: https://ghc.haskell.org/trac/ghc/ticket/8279

    This discussion mentions why things should be aligned, and
    gives some multi-byte no-ops that can be used for padding for
    aligned loops.

    I came across a similar issue a few weeks ago, but I was able
    to "fix" it by allocating more memory than I needed and then
    relocating to an address within that allocation that was
    aligned to the start of a page. This seemed to fix the problem
    back then, but this new flavor of alignment woes is impervious
    to such a trick.


    On Tue, Dec 20, 2016 at 10:29 PM, KHMan <address@hidden
    <mailto:address@hidden>> wrote:

        On 12/20/2016 10:17 PM, David Mertens wrote:

            Hello Kein-Hong,

            I'm not convinced this is entirely an unpredictable
            issue. The reason is that I can easily create similar
            functionality with gcc (the usual Perl XS module, the
            normal means
            for writing a C-based extension) and it does not show
            these kinds
            of cache swings. I think there is something gcc does while
            producing its machine code that makes it less
            susceptible to cache
            misses. (Well, there are lots of things it does, I'm
            sure.) I'm
            hoping there's one or two simple things that gcc does
            which tcc
            misses and could implement.

            Was the behavior observed with Lua noted when working
            with JIT?

        I couldn't find the old posting but it was along the lines
        of benchmark variability due to memory layout, see
        "Mytkowicz memory layout". IIRC, the discussion was about
        a small benchmark Lua script running the interpreter, in
        one posting, changing an environment variable changed the
        program's total running time significantly, IIRC it was in
        the 20-50% range. The timings were done casually and
        nobody did detailed follow-up research.

        ... which of course are the same executables and is
        different from your case. Long day and all. But tcc is not
        much of an optimizing compiler, if the change caused
        register spilling in an inner loop it would hammer memory
        access and account for at least some of the effects...

            On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:

                 On 12/20/2016 9:16 PM, David Mertens wrote:

                     Hello everyone,

                     Reminder/Background: C::Blocks is my Perl
            wrapper around
                     my fork
                     of tcc with extended symbol table support.

                     I've begun writing benchmarks to seriously
            test how C::Blocks
                     compares with other JIT and JIT-ish options
            for Perl. I've
                     a couple of situations in which slight
            modifications to
                     the code
                     cause a huge drop in performance. One
            benchmark went from
                     370ms to
                     5,000ms (i.e. 5 sec).

                     The change to the code was so slight that I
                     cache misses as the culprit. Running with
            linux's "perf"
                     gave proof of that (hopefully this format
            properly with
                     fixed-width characters):

                                     Fast    Slow  Significant
                     time (ms)      370    5022    **
                     instructions  3.5B    3.5B
                     branches      640M    650M
                     branch-miss   687k    671k
                     dcache-miss   974k     71M    **
                     icache-miss   3.2M     83M    **

                     By dcache-miss I refer to what perf calls "L1
            dcache load
                     and by icache-miss I refer to what perf calls
            "L1 icache
                     load miss".

                     I'm a bit confused on what would cause this
            sort of persistent
                     cache miss behavior. In particular, I've
            tried working
                     with highly
                     distinct strategies for managing executable
            memory, including
                     ensuring page alignment (wrong: it should be
                     of 64 bytes). This fixed a similar issue
                     observed, but
                     didn't seem to improve the situation here. I
            used malloc
                     of Perl's built-in memory allocator. I
            created a pool for
                     executable memory so that multiple chunks of
                     code would
                     all be written to the same page in memory.
            EVEN THIS did
                     not fix
                     this issue, which really surprised me since I
            would have
                     adjacent memory would hash to different caches.

                     I believe that what I've found is an issue
            with tcc, but I
                     golfed it down to a simple libtcc-consuming
            example. I can do
                     that, but wanted to see if anybody could
            think of an obvious
                     cause, and fix, without going to such
            lengths. If not, I
                     will see
                     if I can write a small reproducible example.

                 This kind of behaviour was discussed on the Lua
            list not long
                 ago. IIRC, for example changing environment
            variables changed
                 the way a program is loaded, and the timing
            changed. Probably
                 cache behaviour. It's like, what can we really
            benchmark anymore?

                 When modern GHz parts have cache misses and need
            to access
                 main memory, they cause such train wrecks that
            everybody seems
                 to be moving or have already moved to neural
                 (perceptron *cough*) branch prediction.

                 So well, how do we scientifically or meaningfully
                 these days, that is the question... (especially
            for folks in
                 academic needing to justify benchmark results...)

Kein-Hong Man (esq.)
Selangor, Malaysia

reply via email to

[Prev in Thread] Current Thread [Next in Thread]