Re: [Tinycc-devel] Generating better i386 code

tinycc-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Generating better i386 code

From:	grischka
Subject:	Re: [Tinycc-devel] Generating better i386 code
Date:	Wed, 23 Oct 2013 16:36:32 +0200
User-agent:	Thunderbird 2.0.0.23 (Windows/20090812)

Jason Hood wrote:

Greetings.

It's rather funny timing that a couple of topics have come up about
optimization and exe size, as I've just spent the past couple of weeks
improving the generated i386 code (most of which would also apply to
x86-64, but I've not done that).  Not sure what the protocol regarding
patches is, so for now you'll find it on pastebin, based on the 0.9.26
release (as one big diff, I'm afraid).

http://pastebin.com/vdQuhziY


I found some time to try this and I'm actually quite impressed how
this produces much better code wrt. both size and speed with quite
moderate effort.  It's almost gcc -O0 level I guess.

I really like the jump optimization part.  If TCC had a say generic
infrastructure to move around compiled code it could be beneficial
also for the other targets I guess.

I'm slightly skeptical about the "register caching", i.e. how
correct can it be under all circumstances, given the hackishness
in the register handling that TCC alreay has.  The RESET_CACHE_IND
macro in various places is to me like a warning not to immediately
trust this. ;)  One symptom I happened to notice (on win32):

$ .\tcc.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc1.exe -bench
26168 idents, 65111 lines, 2198309 bytes, 0.188 s, 346335 lines/s, 11.7 MB/s
$ .\tcc1.exe -O ..\tcc.c -DONE_SOURCE -DTCC_TARGET_PE -o tcc2.exe -bench
26168 idents, 65111 lines, 2198309 bytes, 0.001 s, 65111000 lines/s, 2198.3 MB/s

Note the "0.001 s" part, something must be wrong there.

There is other stuff that is probably not worth it because the gain
is minimal, such as with the split of chkstk.S.

Anyway I think this experiment is definitely worth to be kept around.
I'd like to encourage you to push this on a fork, as with the "fork"
link top of --> http://repo.or.cz/w/tinycc.git

Ideally of course as a series of single patches for each feature. :)

Thanks,

--- grischka


BTW, it looks like the original source was tab-free, but some tabs have
snuck in, so you may want to (de)tabify the whole lot.  I've also made a
couple of spelling corrections.

First off, here's the results, building my tcc.exe (I'm on Windows, so
I'll also be using Intel syntax) with:

original tcc:                  225792 bytes
my tcc, without optimizations: 218624 bytes (3% reduction)
my tcc, with optimizations:    169472 bytes (25% reduction)

Build times are basically the same (using gcc, it was about 0.01s slower
to build with optimizations; using tcc, the optimized version actually
built the optimized version about 0.01s quicker than the original).

The non-optimized version is smaller, as I've made some changes
independent of the optimizations:

* 4- & 8-byte structs copy as int/long long (all targets);
* passing structs <= 8 bytes will be treated as int/long long;
* returning structs <= 8 bytes is done via (edx)eax (PE only);
* added ebx to the register list (increasing prolog by one, to save it);
* use xor r,r instead of mov r,0;
* use the eax-specific form of instructions;
* use movzx after setxx instead of mov r,0 before;
* use movsx for char & short casts, instead of shl+sar;
* use the byte form of sub esp (via enhanced gadd_sp() function);
* gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]);
* use test r,r instead of cmp r,0;
* use inc/dec r instead of add/sub r,-1;
* use movzx r,br/bw instead of and r,0xff/0xffff;
* or r,-1 (should it occur) replaces its mov r,whatever;
* multiply by 0 (should it occur) becomes xor r,r (replacing its mov);
* multiply by -1 becomes neg r;
* make use of imul r,const;
* simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);
* fix add in the assembler, to use the byte form when appropriate.

To support the optimizations, o() must only be used to start an
instruction.  I've added O<N> macros to combine <N> bytes into a single
int and function og() to combine o() and g().

Optimizations are enabled by using -O, but I neglected to add them to
the help:

    -Of - functions
    -Oj - jumps
    -Om - multiplications and pointer division
    -Or - registers
    -O -O2 -Ox - all optimizations
    -O1 - all but -Oj (i.e. -Ofmr)
    -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment)
    -O0 - no optimizations (default)

-Of will minimize the prolog and epilog.  The full prolog is jumped over
as usual, then when the function is finished, write only what is needed,
move everything back (adjusting relocations to suit) and write the
needed epilog.  As suggested above, I've also aligned PE functions to
16 bytes - this always happens, unless -Os is used (maybe it's not needed,
but I'm so used to seeing it in disassembly listings, it just looks wrong
without it :)).

-Oj will optimize various usages of jump.  Jumps to jmp will be replaced
with the destination of the jmp; resulting skipped jmps will be removed.
Common code before a jmp and its destination (up to eight instructions,
the reason for the o() restriction) will result in removal of the code
before the jmp, changing the jmp destination.  Casting to boolean will
use setxx/movzx or stc/sbb/inc when appropriate.  Conditional jumps
over a jmp will invert the condition and change the destination,
removing the jmp.  Jumps to the epilog will be replaced with the epilog
itself (if it's only one or two bytes with -Os).  Appropriate near jumps
will be converted to short.

-Om will use lea (possibly followed by add, shl or another lea) to do
appropriate constant multiplication.  Pointer division is done by
reciprocal multiplication (which should probably also be used for normal
division, don't know why I didn't).

-Or improves register usage.  Previous values are remembered (this would
ideally be done as part of tccgen).  Appropriate function arguments are
pushed directly.  A load const/store pair stores the const directly.
Suitable adds are turned into a displacement (greatly improving struct
and long long access).

A couple of things I didn't do was combine arithmetic operators (even
though register displacement combines adds) or remove unused locals
(remembering register values means writing to a temporary probably won't
read from it).  And doing it all for x86-64 (in particular, returning
small structs should be done, as that's expected by Windows).

In addition, I've tweaked the Win32 build.  Build-tcc.bat will determine
the target based on gcc itself (although it will need modification if
you still want to support command.com).  Separated lib/chkstk.S into
lib/seh.S (assuming only 32-bit) and lib/sjlj.S (assuming only 64-bit);
however, I didn't update the configure process, only build-tcc.bat.

--
Jason.

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Tinycc-devel] Generating better i386 code, grischka <=
- Re: [Tinycc-devel] Generating better i386 code, grischka, 2013/10/24

Prev by Date: Re: [Tinycc-devel] float value triggers error
Next by Date: Re: [Tinycc-devel] Generating better i386 code
Previous by thread: [Tinycc-devel] float value triggers error
Next by thread: Re: [Tinycc-devel] Generating better i386 code
Index(es):
- Date
- Thread