[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-patch-tracker] [patch #9415] Streamline hermitian transpose rout
From: |
Dan Sebald |
Subject: |
[Octave-patch-tracker] [patch #9415] Streamline hermitian transpose routine of Array.cc for efficiency |
Date: |
Sun, 23 Jul 2017 17:06:10 -0400 (EDT) |
User-agent: |
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 |
URL:
<http://savannah.gnu.org/patch/?9415>
Summary: Streamline hermitian transpose routine of Array.cc
for efficiency
Project: GNU Octave
Submitted by: sebald
Submitted on: Sun 23 Jul 2017 09:06:08 PM UTC
Category: None
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Email:
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
Attached is a patch that improves upon the Hermitian transpose in Array.cc.
The main oversight is the fact that no blocking is done on the last few
columns, and I think that is most likely to be the case where the cache misses
are going to occur. Cases where there are large numbers of rows and only a
few columns (very conceivable user-construct) take a large hit as a result.
Call this "partial blocking".
So I've actually simplified the code by moving those left-over partial blocks
into the main save/copy loops and discarding the extra loops. Call this "full
blocking".
I also found that moving fcn() to the first loop (i.e., saving into temp
buffer) rather than the second loop (i.e., copying out of temp buffer) is 10%
better, or so. Don't know why...probably has something to do with the
function call with data that is more spread out in memory.
Then, as I looked at the resulting code, I wondered why the special case of NR
> 8 and NC > 8. I constructed a test for that and found that the blocking is
beneficial for certainly NR or NC equal 7. There is marginal loss for more
complicated looping when NR=1 or NC=1, so I see no point in not always using
the the block looping and simply tossed that second case code.
I replaced the remaining 8s with SPANC and SPANR where appropriate.
The algorithms tested are:
+verbatim*
Test scripts:
1)
start = cputime; x = rand(10000) + i*rand(10000); cputime - start
start = cputime; x'; cputime - start
2-A)
start = cputime; x = rand(10000000,10) + i*rand(10000000,10); cputime - start
start = cputime; x'; cputime - start
2-B)
start = cputime; x = rand(10,10000000) + i*rand(10,10000000); cputime - start
start = cputime; x'; cputime - start
3-A)
start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime - start
start = cputime; x'; cputime - start
3-B)
start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime - start
start = cputime; x'; cputime - start
4-A)
start = cputime; x = rand(100000000,1) + i*rand(100000000,1); cputime - start
start = cputime; x'; cputime - start
4-B)
start = cputime; x = rand(1,100000000) + i*rand(1,100000000); cputime - start
start = cputime; x'; cputime - start
A summary of the results are as follows:
SUMMARY
NOBLOCK CURRENT PARTIAL FULL
PARTIAL BLOCKING BLOCKING
BLOCK FCN FIRST FCN FIRST
NO NR,NC NO NR,NC
------- ------- --------- ---------
1) 4.05 2.62 2.37 2.35
2-A) 10.3 3.73 3.68 2.47
2-B) 2.08 1.89 1.88 1.88
3-A) 5.77 5.77 5.78 1.60
3-B) 1.59 1.60 1.59 1.58
4-A) 1.79 1.81 1.78 1.78
4-B) 1.77 1.83 1.78 1.83
Here's a more detailed description and numbers:
WITHOUT BLOCKING CACHE OPTIMIZATION
-----------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans = 8.5840
octave:2> start = cputime; x'; cputime - start
ans = 4.0560
octave:3> start = cputime; x'; cputime - start
ans = 4.0640
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans = 9.2800
octave:5> start = cputime; x'; cputime - start
ans = 10.264
octave:6> start = cputime; x'; cputime - start
ans = 10.344
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans = 8.6160
octave:8> start = cputime; x'; cputime - start
ans = 2.0760
octave:9> start = cputime; x'; cputime - start
ans = 2.1000
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans = 7.2560
octave:11> start = cputime; x'; cputime - start
ans = 5.7680
octave:12> start = cputime; x'; cputime - start
ans = 5.7840
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans = 6.9480
octave:14> start = cputime; x'; cputime - start
ans = 1.5880
octave:15> start = cputime; x'; cputime - start
ans = 1.5920
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans = 8.4960
octave:17> start = cputime; x'; cputime - start
ans = 1.7880
octave:18> start = cputime; x'; cputime - start
ans = 1.8000
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans = 8.5920
octave:20> start = cputime; x'; cputime - start
ans = 1.7760
octave:21> start = cputime; x'; cputime - start
ans = 1.7760
WITH CURRENT PARTIAL BLOCKING
-----------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans = 7.4280
octave:2> start = cputime; x'; cputime - start
ans = 2.5960
octave:3> start = cputime; x'; cputime - start
ans = 2.6640
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans = 8.3680
octave:5> start = cputime; x'; cputime - start
ans = 3.7320
octave:6> start = cputime; x'; cputime - start
ans = 3.7400
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans = 8.0280
octave:8> start = cputime; x'; cputime - start
ans = 1.8800
octave:9> start = cputime; x'; cputime - start
ans = 1.8960
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans = 6.4800
octave:11> start = cputime; x'; cputime - start
ans = 5.7680
octave:12> start = cputime; x'; cputime - start
ans = 5.7720
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans = 6.3080
octave:14> start = cputime; x'; cputime - start
ans = 1.6000
octave:15> start = cputime; x'; cputime - start
ans = 1.6080
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans = 7.8640
octave:17> start = cputime; x'; cputime - start
ans = 1.7960
octave:18> start = cputime; x'; cputime - start
ans = 1.8320
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans = 7.8720
octave:20> start = cputime; x'; cputime - start
ans = 1.8280
octave:21> start = cputime; x'; cputime - start
ans = 1.8320
WITH CURRENT PARTIAL BLOCKING AND FCN MOVED AHEAD NO NR,NC RESTRICTION
----------------------------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans = 7.5600
octave:2> start = cputime; x'; cputime - start
ans = 2.3520
octave:3> start = cputime; x'; cputime - start
ans = 2.3920
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans = 8.3320
octave:5> start = cputime; x'; cputime - start
ans = 3.6720
octave:6> start = cputime; x'; cputime - start
ans = 3.6880
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans = 8.1120
octave:8> start = cputime; x'; cputime - start
ans = 1.8720
octave:9> start = cputime; x'; cputime - start
ans = 1.8800
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans = 6.6080
octave:11> start = cputime; x'; cputime - start
ans = 5.7720
octave:12> start = cputime; x'; cputime - start
ans = 5.8080
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans = 6.4880
octave:14> start = cputime; x'; cputime - start
ans = 1.5920
octave:15> start = cputime; x'; cputime - start
ans = 1.5880
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans = 7.9680
octave:17> start = cputime; x'; cputime - start
ans = 1.7880
octave:18> start = cputime; x'; cputime - start
ans = 1.7880
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans = 8.0720
octave:20> start = cputime; x'; cputime - start
ans = 1.7840
octave:21> start = cputime; x'; cputime - start
ans = 1.7840
WITH FULL BLOCKING AND FCN MOVED AHEAD AND NR,NC>=8
---------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans = 8.1040
octave:2> start = cputime; x'; cputime - start
ans = 2.5840
octave:3> start = cputime; x'; cputime - start
ans = 2.5880
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans = 9.1200
octave:5> start = cputime; x'; cputime - start
ans = 2.4560
octave:6> start = cputime; x'; cputime - start
ans = 2.4560
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans = 8.5720
octave:8> start = cputime; x'; cputime - start
ans = 1.9960
octave:9> start = cputime; x'; cputime - start
ans = 2.0040
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans = 7.1400
octave:11> start = cputime; x'; cputime - start
ans = 5.7600
octave:12> start = cputime; x'; cputime - start
ans = 5.7600
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans = 6.7880
octave:14> start = cputime; x'; cputime - start
ans = 1.6120
octave:15> start = cputime; x'; cputime - start
ans = 1.6120
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans = 8.3000
octave:17> start = cputime; x'; cputime - start
ans = 1.7840
octave:18> start = cputime; x'; cputime - start
ans = 1.8000
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans = 8.3960
octave:20> start = cputime; x'; cputime - start
ans = 1.7880
octave:21> start = cputime; x'; cputime - start
ans = 1.7920
WITH FULL BLOCKING AND FCN MOVED AHEAD NO NR,NC RESTRICTION
-----------------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans = 7.5440
octave:2> start = cputime; x'; cputime - start
ans = 2.3560
octave:3> start = cputime; x'; cputime - start
ans = 2.3520
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans = 8.4320
octave:5> start = cputime; x'; cputime - start
ans = 2.4520
octave:6> start = cputime; x'; cputime - start
ans = 2.4840
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans = 8.1520
octave:8> start = cputime; x'; cputime - start
ans = 1.8760
octave:9> start = cputime; x'; cputime - start
ans = 1.8920
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans = 6.6000
octave:11> start = cputime; x'; cputime - start
ans = 1.5960
octave:12> start = cputime; x'; cputime - start
ans = 1.6040
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans = 6.3280
octave:14> start = cputime; x'; cputime - start
ans = 1.5800
octave:15> start = cputime; x'; cputime - start
ans = 1.5960
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans = 7.8520
octave:17> start = cputime; x'; cputime - start
ans = 1.7800
octave:18> start = cputime; x'; cputime - start
ans = 1.7880
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans = 7.9280
octave:20> start = cputime; x'; cputime - start
ans = 1.8280
octave:21> start = cputime; x'; cputime - start
ans = 1.8400
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/patch/?9415>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Octave-patch-tracker] [patch #9415] Streamline hermitian transpose routine of Array.cc for efficiency,
Dan Sebald <=