octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FYI: optimizing certain matrix arithmetic


From: Michael Creel
Subject: Re: FYI: optimizing certain matrix arithmetic
Date: Tue, 29 Sep 2009 10:38:05 +0200

Hi all,

On an Apple Macbook Pro running Ubuntu Jaunty amd64, using the benchmark

%%%%%%%%%%%%%%%%%%%%%%
n = 500;
R = triu (rand (n));
u = rand (n, 1);

tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; u' / R; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc

R = tril (rand (n));
u = rand (n, 1);

tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; u' / R; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc

u = u + I*rand (n, 1);
tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc


n = 800;
a = rand (n);
b = rand (n) + i*rand (n);
tic; a * b; toc
tic; b * a; toc
tic; a' * b; toc
tic; b * a'; toc
tic; a \ b; toc
tic; b / a; toc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Octave3.0.1 that comes with Ubuntu Jaunty amd64, I get


octave:4> bench
Elapsed time is 0.20216 seconds.
Elapsed time is 1.93894 seconds.
Elapsed time is 2.33824 seconds.
Elapsed time is 0.188448 seconds.
Elapsed time is 1.95657 seconds.
Elapsed time is 2.43552 seconds.
Elapsed time is 4.08299 seconds.
Elapsed time is 7.84752 seconds.
Elapsed time is 0.213021 seconds.
Elapsed time is 0.21117 seconds.
Elapsed time is 0.218387 seconds.
Elapsed time is 0.217174 seconds.
Elapsed time is 0.452714 seconds.
Elapsed time is 0.391383 seconds.
octave:5>



Matlab 2008b gives
>> bench
Elapsed time is 0.289161 seconds.
Elapsed time is 0.566446 seconds.
Elapsed time is 0.562623 seconds.
Elapsed time is 0.253456 seconds.
Elapsed time is 0.574304 seconds.
Elapsed time is 0.570281 seconds.
Elapsed time is 0.253070 seconds.
Elapsed time is 0.572601 seconds.
Elapsed time is 0.102086 seconds.
Elapsed time is 0.102677 seconds.
Elapsed time is 0.103080 seconds.
Elapsed time is 0.103759 seconds.
Elapsed time is 0.165608 seconds.
Elapsed time is 0.181704 seconds.
>>


Octave 3.2.3+ from today, self compiled, gives
octave:1> bench
Elapsed time is 0.208794 seconds.
Elapsed time is 0.189178 seconds.
Elapsed time is 0.186724 seconds.
Elapsed time is 0.188649 seconds.
Elapsed time is 0.192915 seconds.
Elapsed time is 0.19166 seconds.
Elapsed time is 0.186277 seconds.
Elapsed time is 0.19102 seconds.
Elapsed time is 0.212707 seconds.
Elapsed time is 0.211013 seconds.
Elapsed time is 0.210491 seconds.
Elapsed time is 0.210447 seconds.
Elapsed time is 0.431791 seconds.
Elapsed time is 0.367412 seconds.
octave:2>

Congratulations!
Michael






On Fri, Sep 25, 2009 at 8:55 AM, Ben Abbott <address@hidden> wrote:
>
> On Sep 23, 2009, at 12:50 PM, Jaroslav Hajek wrote:
>
>> hi all,
>>
>> the following three patches:
>> http://hg.savannah.gnu.org/hgweb/octave/rev/afcf852256d2
>> http://hg.savannah.gnu.org/hgweb/octave/rev/0d3b248f4ab6
>> http://hg.savannah.gnu.org/hgweb/octave/rev/7e5b4de5fbfe
>>
>> add new optimizations for matrix multiplication and division. Quote for
>> NEWS:
>>
>> ** More efficient matrix division handling. Octave is now able to
>> handle the expressions
>>
>>      M' \ v
>>      M.' \ v
>>      v / M
>>
>>   (M is a matrix and v is a vector) more efficiently in certain cases.
>>   In particular, if M is triangular, all three expressions will be
>> handled by a single call to
>>   xTRTRS, with appropriate flags. Previously, all three expressions
>> required a physical transpose
>>   of M.
>>
>> ** More efficient handling of certain mixed real-complex matrix
>> operations.
>>   For instance, if RM is a real matrix and CM a complex matrix,
>>
>>     RM * CM
>>
>>   can now be evaluated either as
>>
>>     complex (RM * real (CM), RM * imag (CM))
>>
>>   or as
>>
>>     complex (RM) * CM,
>>
>>   depending on the dimensions. The 1st form requires more
>> temporaries and copying,
>>   but halves the count of FLOPs, which normally brings better performance
>> if
>>   RM has enough rows. Previously, the 2nd form was always used.
>>
>>   Matrix division is similarly affected.
>>
>> The triangular optimization is well demonstrated by this benchmark:
>>
>> n = 500;
>> R = triu (rand (n));
>> u = rand (n, 1);
>>
>> tic; for i = 1:1000; R \ u; endfor; toc
>> tic; for i = 1:1000; u' / R; endfor; toc
>> tic; for i = 1:1000; R' \ u; endfor; toc
>>
>> R = tril (rand (n));
>> u = rand (n, 1);
>>
>> tic; for i = 1:1000; R \ u; endfor; toc
>> tic; for i = 1:1000; u' / R; endfor; toc
>> tic; for i = 1:1000; R' \ u; endfor; toc
>>
>> u = u + I*rand (n, 1);
>> tic; for i = 1:1000; R \ u; endfor; toc
>> tic; for i = 1:1000; R' \ u; endfor; toc
>>
>>
>> with a recent tip, I got (Core 2 Duo @ 2.83 GHz, g++ -O3
>> -march=native, self-built GotoBLAS + LAPACK):
>>
>> Elapsed time is 0.225974 seconds.
>> Elapsed time is 0.78818 seconds.
>> Elapsed time is 1.81756 seconds.
>> Elapsed time is 0.219842 seconds.
>> Elapsed time is 0.79384 seconds.
>> Elapsed time is 1.83497 seconds.
>> Elapsed time is 4.32738 seconds.
>> Elapsed time is 7.73774 seconds.
>>
>> with the new patches, I get
>>
>> Elapsed time is 0.221489 seconds.
>> Elapsed time is 0.204968 seconds.
>> Elapsed time is 0.204267 seconds.
>> Elapsed time is 0.219243 seconds.
>> Elapsed time is 0.20444 seconds.
>> Elapsed time is 0.202193 seconds.
>> Elapsed time is 0.227214 seconds.
>> Elapsed time is 0.209824 seconds.
>>
>> This is particularly cool when working intensively with Cholesky, LU
>> or QR factorizations - you can do both A \ v and A' \ v almost as
>> efficiently as with raw BLAS.
>>
>> Also, there's some speed-up with real x complex multiplication and
>> division, which is not a very common operation, but sometimes needed.
>> The point is that for bigger matrices, complex (RM * real (CM), RM *
>> imag (CM)) is typically up to 2x faster than complex (RM) * CM.
>> Surprised? Just count the FLOPs. On the other hand, the former
>> expression is less friendly for Octave's complex storage.
>>
>> The benchmark is:
>>
>> n = 800;
>> a = rand (n);
>> b = rand (n) + i*rand (n);
>> tic; a * b; toc
>> tic; b * a; toc
>> tic; a' * b; toc
>> tic; b * a'; toc
>> tic; a \ b; toc
>> tic; b / a; toc
>>
>> with a recent tip:
>>
>> Elapsed time is 0.431021 seconds.
>> Elapsed time is 0.410468 seconds.
>> Elapsed time is 0.411169 seconds.
>> Elapsed time is 0.412252 seconds.
>> Elapsed time is 0.641075 seconds.
>> Elapsed time is 0.66255 seconds.
>>
>> with the new patches:
>>
>> Elapsed time is 0.221522 seconds.
>> Elapsed time is 0.2183 seconds.
>> Elapsed time is 0.22033 seconds.
>> Elapsed time is 0.22058 seconds.
>> Elapsed time is 0.297738 seconds.
>> Elapsed time is 0.325542 seconds.
>>
>> The conclusion is that Octave 3.4 will again be able to map linear
>> algebra code more efficiently onto BLAS and LAPACK than the previous
>> version.
>
> Running Matlab on Mac OSX (MKL, or ???)
>
> Elapsed time is 0.277841 seconds.
> Elapsed time is 2.040261 seconds.
> Elapsed time is 2.051063 seconds.
> Elapsed time is 0.276847 seconds.
> Elapsed time is 2.070356 seconds.
> Elapsed time is 2.023176 seconds.
> Elapsed time is 0.381263 seconds.
> Elapsed time is 2.177680 seconds.
>
> Octave compiled with Apples Lapack/Blas (vecLib).
>
> Elapsed time is 0.211057 seconds.
> Elapsed time is 0.230565 seconds.
> Elapsed time is 0.202384 seconds.
> Elapsed time is 0.219704 seconds.
> Elapsed time is 0.219448 seconds.
> Elapsed time is 0.198678 seconds.
> Elapsed time is 0.308611 seconds.
> Elapsed time is 0.282597 seconds.
>
> Ben
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]