octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Speed improvements and slowups 3.0.x to latest 3.1.x


From: Jaroslav Hajek
Subject: Re: Speed improvements and slowups 3.0.x to latest 3.1.x
Date: Sat, 7 Feb 2009 08:22:39 +0100

On Fri, Feb 6, 2009 at 11:21 PM, David Bateman <address@hidden> wrote:
> To see if there were any speed regressions I just tried running the sciview
> benchmark on a 3.0.x version of Octave and comparing it against a build from
> the repository from a hour a ago. I got the results attached. The good news
> is that
>
> * 3.1 is faster for transposes

> * 3.1 is even faster than the above would explain for operations involving
> transposes like a' * a, though it appears that a \ b' doesn't get an
> advantage from this

Yes. In 3.1, a'*b will be calculated using xGEMM(TRANSA='T',..), no
true transpose. This form is usually better cache-aligned, so this may
be a big win especially if a is larger than b. a'*a is even better -
that should get dispatched to xSYRK/xHERK, exploiting the
symmetry/hermitianness of the operation. The result of a'*a should be
perfectly hermitian.
a \ b' is a bit different beast. We mostly dispatch \ and / to LAPACK
driver solvers, which generally do not support transposed rhs. So, at
present, the only operation truly benefitting from optimizing a \ b'
(via a new compound operator) would be the case with a triangular. I
also don't think you'll get speed-up for cache effects here, so I
deemed this as not worth doing unless there is a demand.
Also, note that in that test b was in fact a vector, so there was
nothing to optimize because transposing vectors is an O(1) operation.

> * The Escoufier test is much faster, I presume due to the recent indexing
> fixes from Jaroslav..
>

Maybe. Seeing this test's code, I would like to note, however, that
trace(A'*A) (here even with off-place transpose) is a fairly
inefficient way to get a (squared) Frobenius norm. Using norm would be
much faster, for instance.

> The bad news is that the sorting appears to be slower..

maybe that depends... I see 0.34868s for 3.1.51+ vs. 0.34806s for 3.0.x.

> Now there have been
> several changes to the sorting that were made, that major one from me that
> made the sort a method of the array classes. However, I seem to remember at
> the time I did this that I didn't notice significant speed changes.. I know
> Jaroslav replaced some malloc/free coe in the sort code with new/delete and
> a couple of other things... Could this be the cause? Any ideas how to get
> the speed of 3.0.x back for this test?
>

Probably the only change possibly having a performance effect here was
my replacement of std::memcpy and std::memmove by std::copy and
std::copy_backward (in fact I see that 3.0.x still has the sorting bug
recently fixed in 3.1.51+). This was necessary for sorting non-POD
types like std::string and octave_value, otherwise the copy
constructors were being bypassed and reference counting screwed.

Now, theoretically memcpy is faster than std::copy because it has the
stronger assumptions (non-overlapping ranges), but I always thought
the differences were minuscule. You can easily choose the fastest
algorithm if you can safely compare any two pointers within your
address space, which you can't in C/C++, but maybe you can at the
system level? I see g++ 4.3. defers std::copy of POD types to an
internal function (__builtin_memmove), so it's a question how good is
that. Were you compiling using an older version / different compiler?
Basically, we could mimick the POD dispatching done in g++'s standard
library files and then put memcpy back into the game, but I really
don't like this idea much, I think this is the sort of stuff compilers
should handle.

Update: I just compiled 3.0.3 and the result is 0.34501s, so it really
is not that bad. I think earlier versions of g++ used C's memmove
rather than __builtin_memmove, so maybe in 4.3 they actually
implemented the optimization I've talked about above. If you used g++
4.2 or earlier, maybe simply upgrading will help.

regarding sort, there are a number of further optimization possibilities.

1. the algorithm, borrowed from Python, is really targeted to sorting
arrays containing already sorted segments, i.e. re-sorting. I did a
few experiments here ind it seems that std::sort from gcc 4.3,
implementing introsort, wins for random arrays but gets beaten for
arrays that contain a lot of sorted segments. Maybe we could create a
hybrid, falling back to std::sort if no long runs are detected.

2. the comparison is currently done by passing in a function pointer.
For simple POD types, this is a significant overhead. In fact I
experimented with allowing it to be inlined by giving octave_sort a
second template parameter, and for random arrays it was a win by up to
20% (IIRC). For partially sorted arrays, the win was smaller
(apparently because the number of comparisons was less compared to the
bookkeeping).
The result was, however, more complicated because you no longer have
one external octave_sort instance per type, a the instantiations got a
bit messy - it took me a while before I sorted out all undefined
references at linking. And I didn't like the idea making the algorithm
completely inline to further slow down compilation.
Now, the mess could be avoided if we resign on optimizing descending
sorts and just optimize the ascending ones by optimizing the case when
octave_sort::compare is null. Currently this is tested in each
comparison (see IFLT in octave_sort.cc) which is probably sub-optimal
even though branch prediction in modern CPUs can cope amazingly well
with stuff like that.
I think it is possible to move the dispatch up a little, using
templates or just code duplication, which could again win us some
percents.

I guess I could try to realize some of the two ideas. Do you know of a
serious Octave application where sorting is a bottleneck?

-- 
RNDr. Jaroslav Hajek
computing expert
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]