Re: [Tinycc-devel] Huge swings in cache performance

On Thu, Jan 5, 2017 at 2:12 PM, avih <address@hidden> wrote:
>
> I can reproduce x30 variations on Windows with tcc64 (built either using gcc (mingw) or using tcc64 itself), but for me -DNOPS=2 or 5 or 9 are fast, and the others (up to 9) are slow. I didn't check further.
>
> I also removed the #include <stdio.h> since it's not where tcc typically is, and it's not required as far as I can tell, and also removed the -B thingy (the tcc binary is in the distribution dir on windows and its default -B location doesn't include anything other than tcc files/libs/includes).

Same here, removed the stdio include and -B. flag, tcc version 0.9.26 (x86-64 Linux), recent desktop CPU:

Results (below), even NOPS are bad, odd NOPS are good up to 8, then it becomes unpredictable.

Hope that helps, -Harry

PS: My first post to TCC list - awesome project - thanks all! :)

time tcc -DNOPS=0 -run test.c
real    0m1.015s

time tcc -DNOPS=1 -run test.c
real    0m0.043s

time tcc -DNOPS=2 -run test.c
real    0m1.215s

time tcc -DNOPS=3 -run test.c
real    0m0.037s

time tcc -DNOPS=4 -run test.c
real    0m1.008s

time tcc -DNOPS=5 -run test.c
real    0m0.051s

time tcc -DNOPS=6 -run test.c
real    0m1.010s

time tcc -DNOPS=7 -run test.c
real    0m0.036s

time tcc -DNOPS=8 -run test.c
real    0m1.014s

time tcc -DNOPS=9 -run test.c
real    0m1.112s

time tcc -DNOPS=10 -run test.c
real    0m0.041s

time tcc -DNOPS=11 -run test.c
real    0m1.161s

time tcc -DNOPS=12 -run test.c
real    0m0.039s

time tcc -DNOPS=13 -run test.c
real    0m1.482s

time tcc -DNOPS=14 -run test.c
real    0m1.009s

time tcc -DNOPS=15 -run test.c
real    0m1.506s

time tcc -DNOPS=16 -run test.c
real    0m1.005s

>
>
> On Thursday, January 5, 2017 3:25 PM, David Mertens <address@hidden> wrote:
>
>
> Hello everyone,
>
> I have now written a very simple C program which gives highly erratic timing behavior when run under tcc -run. I have added this file to the gist; look for cache-test-simple.c here: https://gist.github.com/ run4flat/ fcbb6480275b1b9dcaa7a8d3a80846 38
>
> The simple program does not attempt to produce a shared object library, and so should be runnable on any operating system that supports tcc -run, including Windows and Mac in addition to Linux. Here are some sample outputs on my machine:
>
> $ time ./tcc -B. -DNOPS=0 -run cache-test-simple.c
> real 0m0.052s
> $ time ./tcc -B. -DNOPS=1 -run cache-test-simple.c ***
> real 0m1.413s
> $ time ./tcc -B. -DNOPS=2 -run cache-test-simple.c
> real 0m0.069s
> $ time ./tcc -B. -DNOPS=3 -run cache-test-simple.c
> real 0m0.076s
> $ time ./tcc -B. -DNOPS=4 -run cache-test-simple.c ***
> real 0m1.158s
>
> The starred results are over an order of magnitude slower than the unstarred results.
>
> 1) Do others see this on other operating systems with 64-bit Intel processors?
> 2) Do others see this on any operating system with 64-bit AMD processors?
> 3) Do others see this on any operating system with any other architecture?
>
> Thanks!
> David
>
> On Thu, Jan 5, 2017 at 12:59 AM, David Mertens <address@hidden> wrote:
>
> Update: I *can* get this slowdown with tcc. The main trigger is to have a global variable that gets modified by the function.
>
> I have updated the gist: https://gist.github.com/ run4flat/ fcbb6480275b1b9dcaa7a8d3a80846 38
>
> This program generates a single function filled with a collection of skipped operations (number of operations is a command-line option) and finished with a modification of a global variable. It compiles the function using tcc, then calls the function a specified number of times (repeat count specified via command-line). It can either generate code in-memory, or it can generate a .so file and load that using dlopen. (If it generates in-memory, it prints the size of the generated code.)
>
> Here are the interesting results on my machine, all for 10,000,000 iterations, using compilation-in-memory:
>
> N Code Size (Bytes) Time (s)
> 0 128 2.52
> 1 144 2.54
> 2 176 2.57
> 3 208 0.035
> 4 224 0.058
> 5 256 2.57
> 6 272 0.060
>
> Switching over to a shared object file, I get these results (code size is size of the .so file):
> N Code Size (Bytes) Time (s)
> 0 2960 0.057
> 1 2984 0.040
> 2 3016 0.058
> 3 3040 0.039
> 4 3064 0.040
> 5 3088 0.060
> 6 3112 0.063
>
> As you can see, the jit-compiled code has odd jumps of 30x speed drops depending on... something. The shared object file, on the other hand, has consistently sound performance.
>
> Two questions:
> 1) Can anybody reproduce these effects on their Linux machines, especially different architectures? (I can try an ARM tomorrow.)
> 2) Is there something special about how tcc builds a shared object file that is not happening with the jit-compiled code?
>
> Thanks!
> David
>
> --
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." -- Brian Kernighan
>
>
>
>
> --
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." -- Brian Kernighan
>
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
>
>
>
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
>

From:	Harry van Haaren
Subject:	Re: [Tinycc-devel] Huge swings in cache performance
Date:	Thu, 5 Jan 2017 15:59:48 +0000