Re: [Tinycc-devel] Huge swings in cache performance

On Sun, Jan 8, 2017 at 7:40 AM, David Mertens <address@hidden> wrote:

OK, done! And you were right, we only need to align on 64 bytes!

Follow-up question: since the alignment is only 64-bytes, would it be sensible to have all architectures align to this, including ARM?

David

On Sun, Jan 8, 2017 at 7:19 AM, David Mertens <address@hidden> wrote:
Thanks for the feedback, grischka.

On Sat, Jan 7, 2017 at 6:15 AM, grischka <address@hidden> wrote:
David Mertens wrote:

I just pushed a commit that sets up 512-byte alignment for x86-64
architectures. It only uses 512 bytes for x86-64; for all others it sticks
with the default of 16 bytes.

L1/L2 cache line size is 64 bytes on x86-like processors, no matter
whether run in 32 or 64 bit mode.

Yes, theoretically we should not need to align on anything more than 64 bytes. I chose 512 because I still got slowdowns for smaller alignments, including 256. But you mention...

However to make it work reliably the memory from malloc needs to be
aligned as well, like so:

offset = 0, mem = (addr_t)ptr;
+ mem += -(int)mem & SECTION_ALIGNMENT;

and the possibly additional amount needs to be requested in advance:

if (0 == mem)
- return offset;
+ return offset + SECTION_ALIGNMENT;

If I put this in place, then maybe the section alignment can be lessened. I'll have to check. FWIW, I've been doing this with my own TCC-calling code already and I've seen performance benefits. I don't see how the math would work to let me reduce SECTION_ALIGNMENT to 64 bytes, but I'll experiment and see what happens.

All of this is a black box to me. From what I've read, I don't think we'd need to worry about anything beyond 64 bytes, but I don't understand the underlying CPU behavior well enough to predict. The numbers I actually use will be based on real timing from testing on my machine or from feedback from others.

I ran the tests on my BeagleBone Black with
the original alignment and saw no performance issues,

Obviously ARM don't automatically clear the instruction cache which is
why we have the explicit __clear_cache() call for ARM further down in
set_pages_executable().

I am not sure if this quite follows the project practices. I define
SECTION_ALIGNMENT just prior to the function tcc_relocate_ex. If anybody
can think of a better place to put it, to keep useful things in one place,
please move it.

SECTION_ALIGNMENT seems too general as a name. tccelf.c is full of
section_alignments of various kinds. I'd suggest something prefixed
with RUN_xxxx to indicate that it's used only in that specific place.

Can do! I may not have time today, but I should be able to push a revised commit in the next couple of days.

David

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

From:	David Mertens
Subject:	Re: [Tinycc-devel] Huge swings in cache performance
Date:	Mon, 9 Jan 2017 23:34:06 -0500