[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: conv2 performance
From: |
Jaroslav Hajek |
Subject: |
Re: conv2 performance |
Date: |
Wed, 3 Mar 2010 10:49:50 +0100 |
On Tue, Mar 2, 2010 at 11:26 AM, Jaroslav Hajek <address@hidden> wrote:
> On Tue, Mar 2, 2010 at 10:56 AM, Jaroslav Hajek <address@hidden> wrote:
>> On Mon, Mar 1, 2010 at 9:46 AM, Jaroslav Hajek <address@hidden> wrote:
>>> On Sun, Feb 28, 2010 at 4:53 PM, John W. Eaton <address@hidden> wrote:
>>>> I recieved a complaint that Octave's conv2 function is about 5 times
>>>> slower than Matlab's. By making a few simple changes, I was able to
>>>> improve the perfromance some, but not by a factor of 5. My changes
>>>> are here:
>>>>
>>>> http://hg.savannah.gnu.org/hgweb/octave/rev/dc8637fd7a76
>>>>
>>>> On my system, I see the following improvement after these changes:
>>>>
>>>> > n = 1024; m = 64; x = rand(n,n); k = rand(m,m); tic; y = conv2(x,k); toc
>>>>
>>>> old: Elapsed time is 26.2576 seconds.
>>>> new: Elapsed time is 13.6104 seconds.
>>>>
>>>> and
>>>>
>>>> > n = 1024; m = 512; x = rand(n,1); k = rand(m,1); m = rand (m,n);
>>>> > tic; y = conv2(x,k,m); toc
>>>>
>>>> old: Elapsed time is 8.53273 seconds.
>>>> new: Elapsed time is 3.27103 seconds.
>>>>
>>>> Maybe there is still room for improvement here. I would happily use a
>>>> free software library with a GPL-compatible license to implement this
>>>> function, but I don't know whether one is available.
>>>>
>>>> jwe
>>>>
>>>>
>>>
>>>
>>> I have some ideas for speeding up the conv2 in stock. Using an
>>> existing library would of course be better, but I don't know of a
>>> suitable one. Unless anyone can come up with an existing solution,
>>> I'll try to convert the ideas to code eventually.
>>>
>>>
>>
>> I think John Swensen has a point in that FFT-based convolution is
>> asymptotically better. 4th-order loops can outperform it only for
>> small kernels (the meaning of "small" depending on implementation).
>> Hence I think any loop-based code should primarily target small
>> kernels.
>>
>> I committed an initial implementation of convolutions in liboctave:
>> http://hg.savannah.gnu.org/hgweb/octave/rev/978f5c94b11f
>>
>> summary:
>> This instantiates a specialized code for all kernels up to 8x8 in
>> size. Larger kernels are split into 8x8 blocks.
>> The 4th-order loop is such that the innermost 2 loops have fixed size,
>> hence can be completely unrolled.
>>
>> I used the attached benchmark to measure the time for convolution of a
>> 1000x1000 matrix with all kernel sizes up to 10x10.
>> Times are averaged over 5 runs. Results on my computer (Core 2 Duo @
>> 2.83 GHz, g++ -O3 -march=native) are attached.
>>
>> octave:1> load all_data
>>
>> with a recent tip:
>>
>> octave:2> tocs_oct0
>> tocs_oct0 =
>>
>> 0.012844 0.013554 0.016711 0.022400 0.023357 0.026729
>> 0.030247 0.033534 0.036832 0.040352
>> 0.021440 0.029549 0.039860 0.042910 0.046509 0.043920
>> 0.050186 0.055437 0.060768 0.064602
>> 0.014037 0.022069 0.032884 0.041644 0.046854 0.052915
>> 0.061239 0.067863 0.073653 0.079161
>> 0.015658 0.030140 0.034210 0.045213 0.052253 0.060553
>> 0.069365 0.078562 0.086291 0.093605
>> 0.019656 0.028900 0.045464 0.047990 0.058807 0.066490
>> 0.079297 0.088274 0.097667 0.108786
>> 0.022072 0.033146 0.047572 0.060943 0.070684 0.081349
>> 0.092902 0.100911 0.115715 0.127048
>> 0.021898 0.030693 0.043972 0.058956 0.077249 0.084425
>> 0.100613 0.111503 0.126935 0.141250
>> 0.021071 0.037299 0.050950 0.070723 0.081438 0.101516
>> 0.114788 0.125197 0.141521 0.158815
>> 0.024811 0.036528 0.052532 0.073733 0.095367 0.101299
>> 0.117457 0.132726 0.147585 0.165222
>> 0.024415 0.045528 0.062038 0.077777 0.094293 0.110504
>> 0.127776 0.146341 0.163502 0.178777
>>
>> with the new patch:
>>
>> octave:3> tocs_oct1
>> tocs_oct1 =
>>
>> 0.0074930 0.0093061 0.0101123 0.0059468 0.0067618
>> 0.0073412 0.0095744 0.0111704 0.0155106 0.0155500
>> 0.0056431 0.0061924 0.0084282 0.0121774 0.0137026
>> 0.0166219 0.0200053 0.0235966 0.0279384 0.0284500
>> 0.0060744 0.0084982 0.0126268 0.0165213 0.0221927
>> 0.0273326 0.0332757 0.0390677 0.0434884 0.0455091
>> 0.0063169 0.0110016 0.0172338 0.0247400 0.0311880
>> 0.0387136 0.0465136 0.0544024 0.0592641 0.0642496
>> 0.0066828 0.0141782 0.0221559 0.0309312 0.0406902
>> 0.0501870 0.0602919 0.0492872 0.0548416 0.0618096
>> 0.0084610 0.0162350 0.0308314 0.0393550 0.0504898
>> 0.0641784 0.0500531 0.0585372 0.0652503 0.0736086
>> 0.0105060 0.0235956 0.0373542 0.0465218 0.0594338
>> 0.0500659 0.0609870 0.0713593 0.0790979 0.0899705
>> 0.0148870 0.0233453 0.0382854 0.0579996 0.0480443
>> 0.0571550 0.0695515 0.0805356 0.0899501 0.1026436
>> 0.0159232 0.0315198 0.0468793 0.0580004 0.0528670
>> 0.0641594 0.0777459 0.0901399 0.1043304 0.1173685
>> 0.0191256 0.0280614 0.0450657 0.0671604 0.0608538
>> 0.0769042 0.0929764 0.1040684 0.1214832 0.1341881
>>
>> using Matlab 2007a:
>>
>> octave:4> tocs_matlab
>> tocs_matlab =
>>
>> 0.0062376 0.0139384 0.0180750 0.0221970 0.0220130
>> 0.0257478 0.0305874 0.0344506 0.0383642 0.0428446
>> 0.0109510 0.0266958 0.0351854 0.0458640 0.0620206
>> 0.0698450 0.0828816 0.0961360 0.1055158 0.1111318
>> 0.0134432 0.0272864 0.0418616 0.0544420 0.0615062
>> 0.0739450 0.0873426 0.0982644 0.1111086 0.1225830
>> 0.0184896 0.0378428 0.0543014 0.0658802 0.0816770
>> 0.0981762 0.1139014 0.1303218 0.1464620 0.1628398
>> 0.0219104 0.0428282 0.0660560 0.0862450 0.1027792
>> 0.1222084 0.1425500 0.1629794 0.1833898 0.2025316
>> 0.0265670 0.0535164 0.0782622 0.0981846 0.1218348
>> 0.1462216 0.1700228 0.1945714 0.2187592 0.2437806
>> 0.0340150 0.0575714 0.0851972 0.1174214 0.1405062
>> 0.1686618 0.1965610 0.2246892 0.2526790 0.2803052
>> 0.0348728 0.0733926 0.1020340 0.1292908 0.1611636
>> 0.1934834 0.2251806 0.2569800 0.2883594 0.3213730
>> 0.0423292 0.0779326 0.1099018 0.1503224 0.1818022
>> 0.2175340 0.2541414 0.2897038 0.3259232 0.3611042
>> 0.0422816 0.0899742 0.1258612 0.1622720 0.2018848
>> 0.2427072 0.2823120 0.3228700 0.3632112 0.4034948
>>
>> you can see that this is actually slower in most cases than Octave
>> (even with the previous implementation).
>>
>> The speed-ups (in %) of the new code compared to the old one (more is
>> better, 0 is no speed-up)
>>
>> octave:5> 100 * (tocs_oct0 ./ tocs_oct1 - 1)
>> ans =
>>
>> 71.410 45.645 65.257 276.671 245.425 264.099 215.913
>> 200.206 137.462 159.501
>> 279.935 377.184 372.942 252.378 239.414 164.228 150.863
>> 134.938 117.508 127.072
>> 131.090 159.694 160.428 152.063 111.125 93.595 84.036
>> 73.706 69.362 73.946
>> 147.882 173.962 98.506 82.752 67.542 56.413 49.128
>> 44.409 45.604 45.689
>> 194.121 103.835 105.200 55.150 44.523 32.485 31.522
>> 79.101 78.090 76.002
>> 160.868 104.162 54.297 54.854 39.997 26.754 85.606
>> 72.388 77.340 72.599
>> 108.435 30.080 17.717 26.727 29.975 68.627 64.974
>> 56.256 60.479 56.995
>> 41.537 59.772 33.078 21.937 69.506 77.616 65.040
>> 55.455 57.333 54.725
>> 55.819 15.888 12.059 27.125 80.391 57.886 51.079
>> 47.244 41.459 40.772
>> 27.655 62.245 37.662 15.809 54.951 43.691 37.429
>> 40.620 34.589 33.229
>>
>> apparently (and according to expectations), the small kernels are the
>> most affected ones.
>> I think 8x8 is unnecessarily high (generates 64 functions), I'll
>> probably reduce that to 7x5 or something like that.
>>
>> For comparison, speed-ups compared to Matlab:
>>
>> ans =
>>
>> -16.755 49.776 78.742 273.261 225.548 250.730 219.470
>> 208.410 147.342 175.528
>> 94.059 331.103 317.473 276.633 352.618 320.198 314.298
>> 307.414 277.673 290.621
>> 121.310 221.083 231.530 229.527 177.146 170.538 162.481
>> 151.523 155.490 169.359
>> 192.703 243.976 215.087 166.290 161.886 153.596 144.878
>> 139.552 147.134 153.449
>> 227.861 202.071 198.142 178.828 152.589 143.506 136.433
>> 230.673 234.399 227.670
>> 213.994 229.637 153.839 149.484 141.306 127.836 239.685
>> 232.389 235.262 231.185
>> 223.769 143.992 128.079 152.401 136.408 236.879 222.300
>> 214.870 219.451 211.552
>> 134.249 214.378 166.509 122.917 235.448 238.524 223.761
>> 219.089 220.577 213.096
>> 165.833 147.250 134.435 159.175 243.886 239.052 226.887
>> 221.394 212.395 207.667
>> 121.073 220.633 179.284 141.619 231.754 215.597 203.638
>> 210.248 198.981 200.693
>>
>> hmmm. Maybe there were some improvements in Matlab since?
>>
>> There are still a few gotchas:
>>
>> First, only the "full" convolution is treated directly, the other
>> cases are extracted from it. I can try to add direct code for the
>> "valid" case in the future.
>>
>> Also, quoting the sources:
>>
>> // FIXME: Only the axpy-form accumulation is used. This is natural for outer
>> // convolution as it requires no boundary treatment.
>> // This typically requires one more memory store per operation, but as the
>> // memory access pattern is equivalent (switching ro and rw parts), I
>> wouldn't
>> // expect a significant difference. cf. for instance sum(), where row-wise
>> sum
>> // (axpy-based) is no slower than column-wise (dot-based).
>> // It would be nice, however, if the inner convolution was computed directly
>> by
>> // dot-based accumulation.
>>
>> Looking at the assembler code generated for oct-convn.cc, I would
>> expect the inner loops to be vectorized, but it doesn't appear so -
>> they're only unrolled. This is a bit of surprise to me because with
>> newer gcc -ftree-vectorize is on with -03. I googled a little and
>> found that aliasing might be the issue, but adding __restrict__
>> doesn't seem to help. One more thing to figure out.
>>
>> regards
>>
>
> In any case, convn results are decisive:
>
> old:
> octave:1> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
> (a, b); toc
> Elapsed time is 10.5321 seconds.
>
> new:
> octave:2> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
> (a, b); toc
> Elapsed time is 0.206056 seconds.
>
> I removed the old convn implementation, as the new code appears to be
> much faster.
>
> enjoy
>
More updates: I tried the second of my ideas, and rewrote the conv2
core routines as a sequence of BLAS calls to xAXPY.
I also reordered the outer loops to play more friendly with this approach.
Somewhat surprisingly, I get generally much more plausible numbers,
using the same benchmark as before:
octave:2> tocs
tocs =
0.0089948 0.0063681 0.0065102 0.0103934 0.0080262
0.0085852 0.0090431 0.0096444 0.0103502 0.0117714
0.0072873 0.0096438 0.0105184 0.0083142 0.0097248
0.0108656 0.0110642 0.0125764 0.0135653 0.0140663
0.0066899 0.0075280 0.0108786 0.0133753 0.0110940
0.0128937 0.0142640 0.0152412 0.0170234 0.0176910
0.0077328 0.0103620 0.0120998 0.0120170 0.0129755
0.0151224 0.0170028 0.0185525 0.0204348 0.0216798
0.0073034 0.0090684 0.0131918 0.0162982 0.0148558
0.0173875 0.0195776 0.0213011 0.0236975 0.0254478
0.0084582 0.0119222 0.0143224 0.0148620 0.0169500
0.0197053 0.0228433 0.0241328 0.0275682 0.0297756
0.0110824 0.0098990 0.0124908 0.0193302 0.0188160
0.0220484 0.0248796 0.0273018 0.0307710 0.0329692
0.0092370 0.0135680 0.0170529 0.0178676 0.0206368
0.0245049 0.0276646 0.0303240 0.0343668 0.0379746
0.0118237 0.0113898 0.0148616 0.0224040 0.0231708
0.0295711 0.0335254 0.0342092 0.0406493 0.0436242
0.0102231 0.0151086 0.0189038 0.0221970 0.0274470
0.0323122 0.0336478 0.0393838 0.0466128 0.0452511
only the smallest kernels are somewhat worse, and not by much. Also,
those tiny numbers seem to behave wildly, so it's not even sure (and
I'd need to recompile to average over more runs). The bigger kernels
seem to be up to 3x faster.
The change is online:
http://hg.savannah.gnu.org/hgweb/octave/rev/5af0b4bb384d
this is actually the simplest axpy-based approach. There may be still
room for optimizations. With larger kernels, it would surely pay off
to split the matrix and do the convolution by blocks. However, as has
been already said, with larger kernels the FFT-based approach is
generally preferable if speed is of any concern, because a loop-based
approach is always O(MA*NA*MB*NB), whereas the FFT-based is better
(O(MA*NA*log(MA*NA) + MB*NB*log(MB*NB)).
regards
--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
- Re: conv2 performance, Jaroslav Hajek, 2010/03/01
- Re: conv2 performance, Jaroslav Hajek, 2010/03/02
- Re: conv2 performance, Jaroslav Hajek, 2010/03/02
- Re: conv2 performance,
Jaroslav Hajek <=
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/03
- Re: conv2 performance, John W. Eaton, 2010/03/03
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04
- Re: conv2 performance, Michael Goffioul, 2010/03/04
- Re: conv2 performance, John W. Eaton, 2010/03/04
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04
- Re: conv2 performance, John W. Eaton, 2010/03/04