I forgot to mention: the function in question is a simple random
number generator. It only contains 32-bit integer math operations,
and does not contain any loops. For this benchmark, the looping
occurs at the Perl level, so alignment optimizations for looping
would not be important here. (This lets me compare many different
Perl-to-C function invocation approaches to assess their speed.)
On Tue, Dec 20, 2016 at 10:44 PM, David Mertens wrote:
Discussion about alignment and execution speed for the Haskell
compiler: https://ghc.haskell.org/trac/ghc/ticket/8279
<https://ghc.haskell.org/trac/ghc/ticket/8279>
This discussion mentions why things should be aligned, and
gives some multi-byte no-ops that can be used for padding for
aligned loops.
http://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction
<http://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction>
I came across a similar issue a few weeks ago, but I was able
to "fix" it by allocating more memory than I needed and then
relocating to an address within that allocation that was
aligned to the start of a page. This seemed to fix the problem
back then, but this new flavor of alignment woes is impervious
to such a trick.
David
On Tue, Dec 20, 2016 at 10:29 PM, KHMan <address@hidden
<mailto:address@hidden>> wrote:
On 12/20/2016 10:17 PM, David Mertens wrote:
Hello Kein-Hong,
I'm not convinced this is entirely an unpredictable
hardware
issue. The reason is that I can easily create similar
functionality with gcc (the usual Perl XS module, the
normal means
for writing a C-based extension) and it does not show
these kinds
of cache swings. I think there is something gcc does while
producing its machine code that makes it less
susceptible to cache
misses. (Well, there are lots of things it does, I'm
sure.) I'm
hoping there's one or two simple things that gcc does
which tcc
misses and could implement.
Was the behavior observed with Lua noted when working
with JIT?
I couldn't find the old posting but it was along the lines
of benchmark variability due to memory layout, see
"Mytkowicz memory layout". IIRC, the discussion was about
a small benchmark Lua script running the interpreter, in
one posting, changing an environment variable changed the
program's total running time significantly, IIRC it was in
the 20-50% range. The timings were done casually and
nobody did detailed follow-up research.
... which of course are the same executables and is
different from your case. Long day and all. But tcc is not
much of an optimizing compiler, if the change caused
register spilling in an inner loop it would hammer memory
access and account for at least some of the effects...
On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:
On 12/20/2016 9:16 PM, David Mertens wrote:
Hello everyone,
Reminder/Background: C::Blocks is my Perl
wrapper around
my fork
of tcc with extended symbol table support.
I've begun writing benchmarks to seriously
test how C::Blocks
compares with other JIT and JIT-ish options
for Perl. I've
noticed
a couple of situations in which slight
modifications to
the code
cause a huge drop in performance. One
benchmark went from
370ms to
5,000ms (i.e. 5 sec).
The change to the code was so slight that I
immediately
suspected
cache misses as the culprit. Running with
linux's "perf"
command
gave proof of that (hopefully this format
properly with
fixed-width characters):
Fast Slow Significant
time (ms) 370 5022 **
instructions 3.5B 3.5B
branches 640M 650M
branch-miss 687k 671k
dcache-miss 974k 71M **
icache-miss 3.2M 83M **
By dcache-miss I refer to what perf calls "L1
dcache load
miss",
and by icache-miss I refer to what perf calls
"L1 icache
load miss".
I'm a bit confused on what would cause this
sort of persistent
cache miss behavior. In particular, I've
tried working
with highly
distinct strategies for managing executable
memory, including
ensuring page alignment (wrong: it should be
line-width
alignment
of 64 bytes). This fixed a similar issue
previously
observed, but
didn't seem to improve the situation here. I
used malloc
instead
of Perl's built-in memory allocator. I
created a pool for
executable memory so that multiple chunks of
executable
code would
all be written to the same page in memory.
EVEN THIS did
not fix
this issue, which really surprised me since I
would have
thought
adjacent memory would hash to different caches.
I believe that what I've found is an issue
with tcc, but I
haven't
golfed it down to a simple libtcc-consuming
example. I can do
that, but wanted to see if anybody could
think of an obvious
cause, and fix, without going to such
lengths. If not, I
will see
if I can write a small reproducible example.
This kind of behaviour was discussed on the Lua
list not long
ago. IIRC, for example changing environment
variables changed
the way a program is loaded, and the timing
changed. Probably
cache behaviour. It's like, what can we really
benchmark anymore?
When modern GHz parts have cache misses and need
to access
main memory, they cause such train wrecks that
everybody seems
to be moving or have already moved to neural
network-based
(perceptron *cough*) branch prediction.
So well, how do we scientifically or meaningfully
benchmark
these days, that is the question... (especially
for folks in
academic needing to justify benchmark results...)