[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-libc-dev] Faster integer division
From: |
Ambroz Bizjak |
Subject: |
Re: [avr-libc-dev] Faster integer division |
Date: |
Sat, 8 Jun 2013 18:25:29 +0200 |
Oh, sorry, a small mistake. Compacting it to a *single* loop while
still being faster than gcc may be harder than it looks. This is
because the part where a bit of the result is written (a single "ori"
in my code) may explode into a full 32-bit operation (e.g. increment
"q" if bit needs to be set, then shift left the whole "q"). But
perhaps it could be done faster by involving memory access (keeping a
pointer to the current byte in "q").
On Sat, Jun 8, 2013 at 6:15 PM, Ambroz Bizjak <address@hidden> wrote:
> I've done some further optimization, now it's even faster and uses 4
> registers less (but needs >=r16 for the result due to the use of
> immediate operands). Even better, there are now only 4 kinds of
> iterations (4 macros). The code is here http://ideone.com/EGWcuI and
> pasted below.
>
> I've calculated that my division takes 384 instructions in the worst
> case. I've also disassembled a compiled program where gcc did the
> division (-O4, the attached program, just changed the main func), see
> here http://ideone.com/jZNxkF . I see that the call to __udivmodsi4
> really hasn't been inlined, but the extra instructions are almost
> negligible. In my calculations the gcc division takes 653 instructions
> in the worst case (may be a bit off), excluding the call entry and
> exit. Put simply, my code takes 58% time compared to gcc's division.
> When I said 2/3 previously, I was referring to relative times, not
> probabilities, but this was in a benchmark so it appeared slower due
> to the testing code.
>
> The sources of speedup in my algorithm seem to be:
> 1. Algorithm on the high level. In each iteration, gcc does four
> 32-bit operations (add, compare, subtract, add) in the worst case. I
> only do three (shift, compare, subtract).
> 2. Optimization of groups of 8 iterations. This allows removing some
> instructions where some operands are known to be zero.
> 3. Unrolling individual iterations.
>
> About code size, yes, the algorithm I attached here is indeed much
> larger. However, I believe we can compact it into four loops and still
> be faster than gcc (points 1 and 2 above would still apply). I think
> the result would be about twice the size of gcc algorithm. Even
> compacting it to one loop should still be faster than gcc due to point
> 1.
>
> I don't really know when a faster but larger algorithm should be used
> by gcc, this is part of the reason why I'm posting here. The -Os flag
> is probably one thing to look for, but we need to consider than people
> may not want a giant division code even at other optimization levels.
> Perhaps more specific flags could be used to control the division
> algorithm? On the other hand, if the large algorithm goes into an
> inline function in avr-libc, how do we make sure people can find it?
>
>
> #include <stdint.h>
>
> #define DIVIDE_ITER_0_7(i) \
> " lsl %D[n]\n" \
> " rol __tmp_reg__\n" \
> " cp __tmp_reg__,%A[d]\n" \
> " cpc __zero_reg__,%B[d]\n" \
> " cpc __zero_reg__,%C[d]\n" \
> " cpc __zero_reg__,%D[d]\n" \
> " brcs zero_bit_" #i "__%=\n" \
> " sub __tmp_reg__,%A[d]\n" \
> " ori %D[q],1<<(7-" #i ")\n" \
> "zero_bit_" #i "__%=:\n"
>
> #define DIVIDE_ITER_8_15(i) \
> " lsl %C[n]\n" \
> " rol __tmp_reg__\n" \
> " rol %D[n]\n" \
> " cp __tmp_reg__,%A[d]\n" \
> " cpc %D[n],%B[d]\n" \
> " cpc __zero_reg__,%C[d]\n" \
> " cpc __zero_reg__,%D[d]\n" \
> " brcs zero_bit_" #i "__%=\n" \
> " sub __tmp_reg__,%A[d]\n" \
> " sbc %D[n],%B[d]\n" \
> " ori %C[q],1<<(15-" #i ")\n" \
> "zero_bit_" #i "__%=:\n"
>
> #define DIVIDE_ITER_16_23(i) \
> " lsl %B[n]\n" \
> " rol __tmp_reg__\n" \
> " rol %D[n]\n" \
> " rol %C[n]\n" \
> " cp __tmp_reg__,%A[d]\n" \
> " cpc %D[n],%B[d]\n" \
> " cpc %C[n],%C[d]\n" \
> " cpc __zero_reg__,%D[d]\n" \
> " brcs zero_bit_" #i "__%=\n" \
> " sub __tmp_reg__,%A[d]\n" \
> " sbc %D[n],%B[d]\n" \
> " sbc %C[n],%C[d]\n" \
> " ori %B[q],1<<(23-" #i ")\n" \
> "zero_bit_" #i "__%=:\n"
>
> #define DIVIDE_ITER_24_30(i) \
> " lsl %A[n]\n" \
> " rol __tmp_reg__\n" \
> " rol %D[n]\n" \
> " rol %C[n]\n" \
> " rol %B[n]\n" \
> " cp __tmp_reg__,%A[d]\n" \
> " cpc %D[n],%B[d]\n" \
> " cpc %C[n],%C[d]\n" \
> " cpc %B[n],%D[d]\n" \
> " brcs zero_bit_" #i "__%=\n" \
> " sub __tmp_reg__,%A[d]\n" \
> " sbc %D[n],%B[d]\n" \
> " sbc %C[n],%C[d]\n" \
> " sbc %B[n],%D[d]\n" \
> " ori %A[q],1<<(31-" #i ")\n" \
> "zero_bit_" #i "__%=:\n"
>
> static inline uint32_t divide (uint32_t n, uint32_t d)
> {
> uint32_t q = 0;
>
> asm(
> "clr __tmp_reg__\n"
> DIVIDE_ITER_0_7(0)
> DIVIDE_ITER_0_7(1)
> DIVIDE_ITER_0_7(2)
> DIVIDE_ITER_0_7(3)
> DIVIDE_ITER_0_7(4)
> DIVIDE_ITER_0_7(5)
> DIVIDE_ITER_0_7(6)
> DIVIDE_ITER_0_7(7)
> DIVIDE_ITER_8_15(8)
> DIVIDE_ITER_8_15(9)
> DIVIDE_ITER_8_15(10)
> DIVIDE_ITER_8_15(11)
> DIVIDE_ITER_8_15(12)
> DIVIDE_ITER_8_15(13)
> DIVIDE_ITER_8_15(14)
> DIVIDE_ITER_8_15(15)
> DIVIDE_ITER_16_23(16)
> DIVIDE_ITER_16_23(17)
> DIVIDE_ITER_16_23(18)
> DIVIDE_ITER_16_23(19)
> DIVIDE_ITER_16_23(20)
> DIVIDE_ITER_16_23(21)
> DIVIDE_ITER_16_23(22)
> DIVIDE_ITER_16_23(23)
> DIVIDE_ITER_24_30(24)
> DIVIDE_ITER_24_30(25)
> DIVIDE_ITER_24_30(26)
> DIVIDE_ITER_24_30(27)
> DIVIDE_ITER_24_30(28)
> DIVIDE_ITER_24_30(29)
> DIVIDE_ITER_24_30(30)
> " lsl %A[n]\n"
> " rol __tmp_reg__\n"
> " rol %D[n]\n"
> " rol %C[n]\n"
> " rol %B[n]\n"
> " cp __tmp_reg__,%A[d]\n"
> " cpc %D[n],%B[d]\n"
> " cpc %C[n],%C[d]\n"
> " cpc %B[n],%D[d]\n"
> " sbci %A[q],-1\n"
>
> : [q] "=&a" (q),
> [n] "=&r" (n)
> : "[q]" (q),
> "[n]" (n),
> [d] "r" (d)
> );
>
> return q;
> }
>
> // instructions:
> // 4 (init q) + 1 + 8 * 9 + 8 * 11 + 8 * 13 + 7 * 15 + 10 = 384
>
> volatile uint32_t test1;
> volatile uint32_t test2;
> volatile uint32_t test3;
>
> int main ()
> {
> test3 = divide(test1, test2);
> //test3 = test1 / test2;
> }
>
> On Sat, Jun 8, 2013 at 4:31 PM, Weddington, Eric
> <address@hidden> wrote:
>>
>>
>>> -----Original Message-----
>>> From: address@hidden
>>> [mailto:address@hidden On
>>> Behalf Of Ambroz Bizjak
>>> Sent: Friday, June 07, 2013 8:34 PM
>>> To: address@hidden
>>> Subject: [avr-libc-dev] Faster integer division
>>>
>>> Hi!
>>>
>>> I've found 32bit integer division in gcc too slow, and I managed to
>>> implement division in asm that's faster than what gcc 4.8 produces at
>>> -O4, about 2/3 the time (but I'm not sure if any of this is due to
>>> inlining). The algorithm is restoring division but unrolled and
>>> heavily optimized. Could this get in gcc or avr-libc, wherever the
>>> right place is?
>>>
>>
>> Hi Ambroz,
>>
>> What's difficult is to balance different needs of different users. For the
>> most part, and there are always exceptions to this, most of the AVR GCC
>> users are interested in code size, rather than speed. They would rather see
>> the smallest way to do division, even if it is slower. But I also recognize
>> that there are definitely times where speed is the most desired.
>>
>> So given that, where do you think it best for your algorithm to generated?
>> Should it be generated at a specific optimization level in gcc? Should it be
>> a specialized inline-assembly function call in avr-libc that is specifically
>> called by the user?
>>
>> You said that it's faster than what gcc produces about 2/3 of the time, but
>> you're not sure if that is due to inlining. Can you do some further testing
>> to see if it's due to inlining? Or whether it's because your algorithm is
>> better? This is important to know, to see if it's worth the time and effort
>> to get it in gcc or avr-libc.
>>
>> Thanks,
>> Eric