tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance


From: KHMan
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Fri, 06 Jan 2017 13:47:59 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 1/6/2017 11:44 AM, David Mertens wrote:
Spot on, grischka!

Initial experiments indicate that changing the offset alignment
from 16 to 512 bytes (i.e. 15 to 511) solves this problem. I'll
try a few more experiments to be sure, though. I suspect that this
number should be tuned to the underlying architecture, but that's
not central to the discussion.

One thing that can trash I$ and D$ this way, a smart prefetcher... Any newish CPU could be susceptible.

David

On Thu, Jan 5, 2017 at 6:46 PM, grischka wrote:

    You might try larger "section alignment" for -run:

    in tccrun.c:208 instead of
             offset = (offset + 15) & ~15;
    for example
             offset = (offset + 63) & ~63;

    This would add more space between your "foo" data variable and
    the instructions in memory

    --- grischka


    Harry van Haaren wrote:

        On Thu, Jan 5, 2017 at 2:12 PM, avih <address@hidden
        <mailto:address@hidden>> wrote:

            I can reproduce x30 variations on Windows with tcc64
            (built either using

        gcc (mingw) or using tcc64 itself), but for me -DNOPS=2 or
        5 or 9 are fast,
        and the others (up to 9) are slow. I didn't check further.

            I also removed the #include <stdio.h> since it's not
            where tcc typically

        is, and it's not required as far as I can tell, and also
        removed the -B
        thingy (the tcc binary is in the distribution dir on
        windows and its
        default -B location doesn't include anything other than tcc
        files/libs/includes).

        Same here, removed the stdio include and -B. flag, tcc
        version 0.9.26
        (x86-64 Linux), recent desktop CPU:
        Results (below), even NOPS are bad, odd NOPS are good up
        to 8, then it
        becomes unpredictable.

        Hope that helps, -Harry

        PS: My first post to TCC list - awesome project - thanks
        all! :)


        time tcc -DNOPS=0 -run test.c
        real    0m1.015s

        time tcc -DNOPS=1 -run test.c
        real    0m0.043s

        time tcc -DNOPS=2 -run test.c
        real    0m1.215s

        time tcc -DNOPS=3 -run test.c
        real    0m0.037s

        time tcc -DNOPS=4 -run test.c
        real    0m1.008s

        time tcc -DNOPS=5 -run test.c
        real    0m0.051s

        time tcc -DNOPS=6 -run test.c
        real    0m1.010s

        time tcc -DNOPS=7 -run test.c
        real    0m0.036s

        time tcc -DNOPS=8 -run test.c
        real    0m1.014s

        time tcc -DNOPS=9 -run test.c
        real    0m1.112s

        time tcc -DNOPS=10 -run test.c
        real    0m0.041s

        time tcc -DNOPS=11 -run test.c
        real    0m1.161s

        time tcc -DNOPS=12 -run test.c
        real    0m0.039s

        time tcc -DNOPS=13 -run test.c
        real    0m1.482s

        time tcc -DNOPS=14 -run test.c
        real    0m1.009s

        time tcc -DNOPS=15 -run test.c
        real    0m1.506s

        time tcc -DNOPS=16 -run test.c
        real    0m1.005s




            On Thursday, January 5, 2017 3:25 PM, David Mertens <

        address@hidden
        <mailto:address@hidden>> wrote:


            Hello everyone,

            I have now written a very simple C program which gives
            highly erratic

        timing behavior when run under tcc -run. I have added this
        file to the
        gist; look for cache-test-simple.c here:
        https://gist.github.com/ run4flat/
        fcbb6480275b1b9dcaa7a8d3a80846 38

            The simple program does not attempt to produce a
            shared object library,

        and so should be runnable on any operating system that
        supports tcc -run,
        including Windows and Mac in addition to Linux. Here are
        some sample
        outputs on my machine:

            $ time ./tcc -B. -DNOPS=0 -run cache-test-simple.c
            real    0m0.052s
            $ time ./tcc -B. -DNOPS=1 -run cache-test-simple.c  ***
            real    0m1.413s
            $ time ./tcc -B. -DNOPS=2 -run cache-test-simple.c
            real    0m0.069s
            $ time ./tcc -B. -DNOPS=3 -run cache-test-simple.c
            real    0m0.076s
            $ time ./tcc -B. -DNOPS=4 -run cache-test-simple.c  ***
            real    0m1.158s

            The starred results are over an order of magnitude
            slower than the

        unstarred results.

            1) Do others see this on other operating systems with
            64-bit Intel

        processors?

            2) Do others see this on any operating system with
            64-bit AMD processors?
            3) Do others see this on any operating system with any
            other architecture?

            Thanks!
            David

            On Thu, Jan 5, 2017 at 12:59 AM, David Mertens
            <address@hidden
            <mailto:address@hidden>>

        wrote:

            Update: I *can* get this slowdown with tcc. The main
            trigger is to have a

        global variable that gets modified by the function.

            I have updated the gist: https://gist.github.com/
            run4flat/

        fcbb6480275b1b9dcaa7a8d3a80846 38

            This program generates a single function filled with a
            collection of

        skipped operations (number of operations is a command-line
        option) and
        finished with a modification of a global variable. It
        compiles the function
        using tcc, then calls the function a specified number of
        times (repeat
        count specified via command-line). It can either generate
        code in-memory,
        or it can generate a .so file and load that using dlopen.
        (If it generates
        in-memory, it prints the size of the generated code.)

            Here are the interesting results on my machine, all
            for 10,000,000

        iterations, using compilation-in-memory:

            N   Code Size (Bytes)   Time (s)
            0                 128       2.52
            1                 144       2.54
            2                 176       2.57
            3                 208       0.035
            4                 224       0.058
            5                 256       2.57
            6                 272       0.060

            Switching over to a shared object file, I get these
            results (code size is

        size of the .so file):

            N   Code Size (Bytes)   Time (s)
            0                2960       0.057
            1                2984       0.040
            2                3016       0.058
            3                3040       0.039
            4                3064       0.040
            5                3088       0.060
            6                3112       0.063

            As you can see, the jit-compiled code has odd jumps of
            30x speed drops

        depending on... something. The shared object file, on the
        other hand, has
        consistently sound performance.

            Two questions:
            1) Can anybody reproduce these effects on their Linux
            machines,

        especially different architectures? (I can try an ARM
        tomorrow.)

            2) Is there something special about how tcc builds a
            shared object file

        that is not happening with the jit-compiled code?

            Thanks!
            David

            --
              "Debugging is twice as hard as writing the code in
            the first place.
               Therefore, if you write the code as cleverly as
            possible, you are,
               by definition, not smart enough to debug it." --
            Brian Kernighan




            --
              "Debugging is twice as hard as writing the code in
            the first place.
               Therefore, if you write the code as cleverly as
            possible, you are,
               by definition, not smart enough to debug it." --
            Brian Kernighan

            _______________________________________________
            Tinycc-devel mailing list
            address@hidden <mailto:address@hidden>
            https://lists.nongnu.org/mailman/listinfo/tinycc-devel
            <https://lists.nongnu.org/mailman/listinfo/tinycc-devel>



            _______________________________________________
            Tinycc-devel mailing list
            address@hidden <mailto:address@hidden>
            https://lists.nongnu.org/mailman/listinfo/tinycc-devel
            <https://lists.nongnu.org/mailman/listinfo/tinycc-devel>



        ------------------------------------------------------------------------

        _______________________________________________
        Tinycc-devel mailing list
        address@hidden <mailto:address@hidden>
        https://lists.nongnu.org/mailman/listinfo/tinycc-devel
        <https://lists.nongnu.org/mailman/listinfo/tinycc-devel>




    _______________________________________________
    Tinycc-devel mailing list
    address@hidden <mailto:address@hidden>
    https://lists.nongnu.org/mailman/listinfo/tinycc-devel
    <https://lists.nongnu.org/mailman/listinfo/tinycc-devel>




--
  "Debugging is twice as hard as writing the code in the first place.
   Therefore, if you write the code as cleverly as possible, you are,
   by definition, not smart enough to debug it." -- Brian Kernighan


_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


--
Cheers,
Kein-Hong Man (esq.)
Selangor, Malaysia




reply via email to

[Prev in Thread] Current Thread [Next in Thread]