[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)
From: |
Thomas Watson |
Subject: |
[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed) |
Date: |
Sat, 18 Oct 2014 12:08:23 -0500 |
GCC is generating substantially less optimized code than it does if I help it
along a bit. Code is at http://pastie.org/private/awus9tkgdwbzpdwjgrbw and the
assembler output is at http://pastie.org/private/s4liesmrd9f6fi2wahe0vg . Top
block is with cx and bottom block is modifying the argument instead of copying
it to a local. Full compiler invocation is: avr-gcc -c -I. -mmcu=atmega328p
-std=gnu99 -Os -Wall -DF_CPU=16000000 -ffunction-sections -fdata-sections
-Wl,--gc-sections -o tft.o tft.c . I thought that copying x to a local might be
wasting a bit of space. However, if we modify x directly rather than copying it
to a local before modification, the compiler decides to store x on the stack
instead of in a register which takes us on a journey involving unnecessary
stack access, silly re-copying, and far too much code.
I figured that if I didn't copy it to a local, I would save code space (like I
do in many other situations) but something is going wrong here. If I have cx, a
callee-saved register is reserved for it (line 33) and x is copied to cx (56).
When we call tft_draw_chr, it expects the 'x' parameter in r24, so we copy it
there (68) before we call. Since r24 might be eaten by tft_draw_chr, we can't
use it to store x through the call and not have to bother with r17. Anyway,
once we return, we add FONT_WIDTH to r17 (73) in preparation for the next
iteration of the loop. In addition, since we can use Y for the string pointer,
we do not need to worry about it being eaten by tft_draw_chr and it is only
pushed and popped in the prologue and epilogue. All well and dandy, right? In
theory, since x is never touched before or after cx is assigned, it is
essentially an alias. I would therefore expect exactly the same code (or
perhaps more optimized if the architecture and calling convention supported it)
to be generated.
However, such is not the case. x is passed into the function in r24, but we
want to modify it and have it persist through the loop. Because r24 could be
mangled by calling a function, we obviously must move it to elsewhere.
Unfortunately, the compiler decides on r25 (170), a register subject to the
same limitation. As before, we must move our temporary register to r24 (186) in
order to call tft_draw_chr, according to calling convention. However, since r25
could be mangled by the call, we have to save it (187) before the function
call. The compiler chooses the stack, as opposed to a callee-saved register,
which has rather broad implications. First, we must reserve stack space (159)
and copy the stack pointer to Y (162), chosen presumably because Y is also
callee-saved. But since we used Y as the string pointer before, we must store
the string pointer elsewhere. R8/9 are chosen. As callee-saved registers, we
must perform an additional two pushes and pops to save them at the beginning
and end of the function. We also have to move the string pointer there (174).
Okay. So we've returned from tft_draw_chr (192). We must pull x off the stack
into r25 and add FONT_WIDTH to it (193) in preparation for the next iteration.
We could have just as easily not used r25 and continued to use r24, using the
stack to save it as before (but there is a better way). We know r24 won't be
touched until we call tft_draw_chr. Now that that's over, we have to fetch the
next character in the string, but because Y isn't our string pointer, this
doesn't go smoothly. We can't load data if the address isn't in X, Y, or Z.
Since r8/9 is none of those, we have to copy it to Z (198) to retrieve the next
character and do a post-increment on Z to index the next character (199). Since
Z isn't callee-saved, it might be mangled by a function call, so we must store
it back to r8/9 (200). Finally, we can test for the next iteration.
I'm not sure why the second code doesn't end up the same as the first. Choosing
to use another caller-saved register as our temporary register is an extremely
poor choice. If for some reason that was mandatory, we could (at least in this
code) still use r24 to avoid having to copy between it and r25. However,
instead of using a register like r9 to store our temporary register, we use one
that isn't callee-saved, which means we still end up using r9 (and r8 too!) in
our quest to needlessly use the stack.
This is probably way too verbose but there must be some useful information in
there somewhere. Please take a look. Also, CC me on any replies because I'm not
subscribed to the list yet.
Thank you all,
Thomas
- [avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed),
Thomas Watson <=