|
From: | David Mertens |
Subject: | Re: [Tinycc-devel] Huge swings in cache performance |
Date: | Fri, 6 Jan 2017 17:14:20 -0500 |
On 1/6/2017 11:44 AM, David Mertens wrote:
Spot on, grischka!
Initial experiments indicate that changing the offset alignment
from 16 to 512 bytes (i.e. 15 to 511) solves this problem. I'll
try a few more experiments to be sure, though. I suspect that this
number should be tuned to the underlying architecture, but that's
not central to the discussion.
One thing that can trash I$ and D$ this way, a smart prefetcher... Any newish CPU could be susceptible.
David
On Thu, Jan 5, 2017 at 6:46 PM, grischka wrote:
You might try larger "section alignment" for -run:
in tccrun.c:208 instead of
offset = (offset + 15) & ~15;
for example
offset = (offset + 63) & ~63;
This would add more space between your "foo" data variable and
the instructions in memory
--- grischka
Harry van Haaren wrote:
On Thu, Jan 5, 2017 at 2:12 PM, avih <address@hidden<mailto:address@hidden>> wrote:
I can reproduce x30 variations on Windows with tcc64
(built either using
gcc (mingw) or using tcc64 itself), but for me -DNOPS=2 or
5 or 9 are fast,
and the others (up to 9) are slow. I didn't check further.
I also removed the #include <stdio.h> since it's not
where tcc typically
is, and it's not required as far as I can tell, and also
removed the -B
thingy (the tcc binary is in the distribution dir on
windows and its
default -B location doesn't include anything other than tcc
files/libs/includes).
Same here, removed the stdio include and -B. flag, tcc
version 0.9.26
(x86-64 Linux), recent desktop CPU:
Results (below), even NOPS are bad, odd NOPS are good up
to 8, then it
becomes unpredictable.
Hope that helps, -Harry
PS: My first post to TCC list - awesome project - thanks
all! :)
time tcc -DNOPS=0 -run test.c
real 0m1.015s
time tcc -DNOPS=1 -run test.c
real 0m0.043s
time tcc -DNOPS=2 -run test.c
real 0m1.215s
time tcc -DNOPS=3 -run test.c
real 0m0.037s
time tcc -DNOPS=4 -run test.c
real 0m1.008s
time tcc -DNOPS=5 -run test.c
real 0m0.051s
time tcc -DNOPS=6 -run test.c
real 0m1.010s
time tcc -DNOPS=7 -run test.c
real 0m0.036s
time tcc -DNOPS=8 -run test.c
real 0m1.014s
time tcc -DNOPS=9 -run test.c
real 0m1.112s
time tcc -DNOPS=10 -run test.c
real 0m0.041s
time tcc -DNOPS=11 -run test.c
real 0m1.161s
time tcc -DNOPS=12 -run test.c
real 0m0.039s
time tcc -DNOPS=13 -run test.c
real 0m1.482s
time tcc -DNOPS=14 -run test.c
real 0m1.009s
time tcc -DNOPS=15 -run test.c
real 0m1.506s
time tcc -DNOPS=16 -run test.c
real 0m1.005s
On Thursday, January 5, 2017 3:25 PM, David Mertens <
address@hidden<mailto:address@hidden<mailto:address@hiddenom >> wrote:
Hello everyone,
I have now written a very simple C program which gives
highly erratic
timing behavior when run under tcc -run. I have added this
file to the
gist; look for cache-test-simple.c here:
https://gist.github.com/ run4flat/
fcbb6480275b1b9dcaa7a8d3a80846 38
The simple program does not attempt to produce a
shared object library,
and so should be runnable on any operating system that
supports tcc -run,
including Windows and Mac in addition to Linux. Here are
some sample
outputs on my machine:
$ time ./tcc -B. -DNOPS=0 -run cache-test-simple.c
real 0m0.052s
$ time ./tcc -B. -DNOPS=1 -run cache-test-simple.c ***
real 0m1.413s
$ time ./tcc -B. -DNOPS=2 -run cache-test-simple.c
real 0m0.069s
$ time ./tcc -B. -DNOPS=3 -run cache-test-simple.c
real 0m0.076s
$ time ./tcc -B. -DNOPS=4 -run cache-test-simple.c ***
real 0m1.158s
The starred results are over an order of magnitude
slower than the
unstarred results.
1) Do others see this on other operating systems with
64-bit Intel
processors?
2) Do others see this on any operating system with
64-bit AMD processors?
3) Do others see this on any operating system with any
other architecture?
Thanks!
David
On Thu, Jan 5, 2017 at 12:59 AM, David Mertens
<address@hiddenom >>address@hidden <mailto:address@hidden
wrote:
Update: I *can* get this slowdown with tcc. The main
trigger is to have a
global variable that gets modified by the function.
I have updated the gist: https://gist.github.com/
run4flat/
fcbb6480275b1b9dcaa7a8d3a80846 38
This program generates a single function filled with a
collection of
skipped operations (number of operations is a command-line
option) and
finished with a modification of a global variable. It
compiles the function
using tcc, then calls the function a specified number of
times (repeat
count specified via command-line). It can either generate
code in-memory,
or it can generate a .so file and load that using dlopen.
(If it generates
in-memory, it prints the size of the generated code.)
Here are the interesting results on my machine, all
for 10,000,000
iterations, using compilation-in-memory:
N Code Size (Bytes) Time (s)
0 128 2.52
1 144 2.54
2 176 2.57
3 208 0.035
4 224 0.058
5 256 2.57
6 272 0.060
Switching over to a shared object file, I get these
results (code size is
size of the .so file):
N Code Size (Bytes) Time (s)
0 2960 0.057
1 2984 0.040
2 3016 0.058
3 3040 0.039
4 3064 0.040
5 3088 0.060
6 3112 0.063
As you can see, the jit-compiled code has odd jumps of
30x speed drops
depending on... something. The shared object file, on the
other hand, has
consistently sound performance.
Two questions:
1) Can anybody reproduce these effects on their Linux
machines,
especially different architectures? (I can try an ARM
tomorrow.)
2) Is there something special about how tcc builds a
shared object file
that is not happening with the jit-compiled code?
Thanks!
David
--
"Debugging is twice as hard as writing the code in
the first place.
Therefore, if you write the code as cleverly as
possible, you are,
by definition, not smart enough to debug it." --
Brian Kernighan
--
"Debugging is twice as hard as writing the code in
the first place.
Therefore, if you write the code as cleverly as
possible, you are,
by definition, not smart enough to debug it." --
Brian Kernighan
_______________________________________________
Tinycc-devel mailing listg >
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
<https://lists.nongnu.org/mailman/listinfo/tinycc-devel >
_______________________________________________ address@hidden <mailto:address@hidden
Tinycc-devel mailing listg >
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
<https://lists.nongnu.org/mailman/listinfo/tinycc-devel >
------------------------------------------------------------ address@hidden <mailto:address@hidden------------
_______________________________________________
Tinycc-devel mailing listg >
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
<https://lists.nongnu.org/mailman/listinfo/tinycc-devel >
_______________________________________________ address@hidden <mailto:address@hidden
Tinycc-devel mailing listg >
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
<https://lists.nongnu.org/mailman/listinfo/tinycc-devel >
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan
_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
--
Cheers,
Kein-Hong Man (esq.)
Selangor, Malaysia
_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel
[Prev in Thread] | Current Thread | [Next in Thread] |