[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance

From: David Mertens
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Tue, 20 Dec 2016 09:17:27 -0500

Hello Kein-Hong,

I'm not convinced this is entirely an unpredictable hardware issue. The reason is that I can easily create similar functionality with gcc (the usual Perl XS module, the normal means for writing a C-based extension) and it does not show these kinds of cache swings. I think there is something gcc does while producing its machine code that makes it less susceptible to cache misses. (Well, there are lots of things it does, I'm sure.) I'm hoping there's one or two simple things that gcc does which tcc misses and could implement.

Was the behavior observed with Lua noted when working with JIT?


On Tue, Dec 20, 2016 at 9:05 AM, KHMan <address@hidden> wrote:
On 12/20/2016 9:16 PM, David Mertens wrote:
Hello everyone,

Reminder/Background: C::Blocks is my Perl wrapper around my fork
of tcc with extended symbol table support.

I've begun writing benchmarks to seriously test how C::Blocks
compares with other JIT and JIT-ish options for Perl. I've noticed
a couple of situations in which slight modifications to the code
cause a huge drop in performance. One benchmark went from 370ms to
5,000ms (i.e. 5 sec).

The change to the code was so slight that I immediately suspected
cache misses as the culprit. Running with linux's "perf" command
gave proof of that (hopefully this format properly with
fixed-width characters):

               Fast    Slow  Significant
time (ms)      370    5022    **
instructions  3.5B    3.5B
branches      640M    650M
branch-miss   687k    671k
dcache-miss   974k     71M    **
icache-miss   3.2M     83M    **

By dcache-miss I refer to what perf calls "L1 dcache load miss",
and by icache-miss I refer to what perf calls "L1 icache load miss".

I'm a bit confused on what would cause this sort of persistent
cache miss behavior. In particular, I've tried working with highly
distinct strategies for managing executable memory, including
ensuring page alignment (wrong: it should be line-width alignment
of 64 bytes). This fixed a similar issue previously observed, but
didn't seem to improve the situation here. I used malloc instead
of Perl's built-in memory allocator. I created a pool for
executable memory so that multiple chunks of executable code would
all be written to the same page in memory. EVEN THIS did not fix
this issue, which really surprised me since I would have thought
adjacent memory would hash to different caches.

I believe that what I've found is an issue with tcc, but I haven't
golfed it down to a simple libtcc-consuming example. I can do
that, but wanted to see if anybody could think of an obvious
cause, and fix, without going to such lengths. If not, I will see
if I can write a small reproducible example.

This kind of behaviour was discussed on the Lua list not long ago. IIRC, for example changing environment variables changed the way a program is loaded, and the timing changed. Probably cache behaviour. It's like, what can we really benchmark anymore?

When modern GHz parts have cache misses and need to access main memory, they cause such train wrecks that everybody seems to be moving or have already moved to neural network-based (perceptron *cough*) branch prediction.

So well, how do we scientifically or meaningfully benchmark these days, that is the question... (especially for folks in academic needing to justify benchmark results...)

Kein-Hong Man (esq.)
Selangor, Malaysia

Tinycc-devel mailing list

 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]