bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Benchmarks, experiments, speedups


From: Øystein Johansen
Subject: Re: [Bug-gnubg] Benchmarks, experiments, speedups
Date: Fri, 15 Apr 2005 19:30:19 +0200
User-agent: Mozilla Thunderbird 0.8 (Windows/20040913)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

macherius wrote:
| I've produced 2 intrinsic versions of the inner loop from listing 1,
| one using GCC syntax (listing 4 (C) and 5 (ASM output)) and one using
| icc syntax (listing 6 (C) and 7 (ASM output)). Do to time and bug
| constraints, I was not able to run the gcc version, it may well be
| incorrect.

I've made a vectorized version of the last loop in Evaluate() that
works. I've used GCC intrinsics. The last loop does only 128 * 5
mulitplications so I didn't expcet much of a speed improvement, and I
didn't get much either.

I've attached a patch. It works as expected, but it assumes that the
nueral net has 128 hidden nodes, so make sure you don't use any of the
pruning neural nets.

It's compiled with GCC 3.4.2. With GCC 4.1 I get some errors with the
typedef of v4sf.

Can you look at the patch and see if it can be improved further? If not
I will start working on the main looks with 250 * 128 multiplications. I
guess that will be a real killer.

| So what is worth knowing if we decided to code intrinsics for gcc?

| 1) All commands needed for "float" as compared to "double" arithmetic
| are in SSE already. The only addition useful for gnubg code in SSE2
| would be data type conversion (i.e. int to float). The other commands
| deal with double precision arithmetic, which gnubg does not use. So
| we should produce SSE code rather than SSE2 code, which will run on a
| much wider base of CPUs (e.g. including AMD) too.

Sure! I've only used SSE intrisics, no SSE2!

| 2) In the early stages of development, gcc used Intel syntax for it's
|  intrinsics but later gcc switched to an own naming scheme. The
| intrinsics are still 1:1 beside naming and the fact that Intel offers
| a few more, which are macros (i.e. combination of several SSE
| instructions). So if intrinsics are coded for gnubg, it seems
| appropriate to use an own syntax to be able to #define compile time
| appearance for both Intel and gcc.

Since I only have GCC, I will use the GCC naming scheme. Let's worry
about other compilers later. I guess it won't be that many changes.

[snipp 3 and 4 about alignment]

Alignment? I'm not even sure I know what it is....

- -Øystein
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCX/or6kDTFPhwyqYRArZ0AJ9rhplX3lNfGYWIxoUMxgIgTbWvJQCfYtaW
te6b4QH/Lgc3dIAMal1wf5Y=
=S8M7
-----END PGP SIGNATURE-----

Index: neuralnet.c
===================================================================
RCS file: /cvsroot/gnubg/gnubg/lib/neuralnet.c,v
retrieving revision 1.23
diff -u -r1.23 neuralnet.c
--- neuralnet.c 25 Feb 2005 11:34:24 -0000      1.23
+++ neuralnet.c 15 Apr 2005 17:10:25 -0000
@@ -444,6 +444,21 @@
   return 0;
 }
 
+typedef int v4sf __attribute__ ((mode(V4SF)));
+
+typedef union _vec4f {
+  v4sf v;
+  float f[4];
+} vec4f;
+
+#if DEBUG
+void
+printvec( v4sf vec ){
+  float *pFloat = ( float *) &vec;
+  printf("%f, %f, %f, %f\n", pFloat[0], pFloat[1], pFloat[2], pFloat[3]);
+}
+#endif
+
 static int Evaluate( neuralnet *pnn, float arInput[], float ar[],
                         float arOutput[], float *saveAr ) {
 
@@ -452,6 +467,8 @@
 #else
     int i, j;
     float *prWeight;
+    
+    assert(pnn->cHidden == 128);
 
     /* Calculate activity at hidden nodes */
     for( i = 0; i < pnn->cHidden; i++ )
@@ -484,14 +501,22 @@
 
     /* Calculate activity at output nodes */
     prWeight = pnn->arOutputWeight;
-
+    
     for( i = 0; i < pnn->cOutput; i++ ) {
-       float r = pnn->arOutputThreshold[ i ];
-       
-       for( j = 0; j < pnn->cHidden; j++ )
-           r += ar[ j ] * *prWeight++;
-
-       arOutput[ i ] = sigmoid( -pnn->rBetaOutput * r );
+       float r = pnn->arOutputThreshold[ i ];
+       float *pr = ar;
+       vec4f sum;
+       v4sf vec0, vec1, vec3;
+       sum.v = __builtin_ia32_xorps(sum.v, sum.v);
+       for( j = 32; j ; j--, prWeight += 4, pr += 4 ){
+         vec0 = __builtin_ia32_loadups(pr);       /* Four floats into vec0 */
+         vec1 = __builtin_ia32_loadups(prWeight); /* Four weights into vect1 
*/ 
+         vec3 = __builtin_ia32_mulps(vec0, vec1); /* Multiply */
+         sum.v = __builtin_ia32_addps(sum.v, vec3); /* Add */
+       }
+       
+       r += sum.f[0] + sum.f[1] + sum.f[2] + sum.f[3]; 
+       arOutput[ i ] = sigmoid( -pnn->rBetaOutput * r );
     }
 
     return 0;


reply via email to

[Prev in Thread] Current Thread [Next in Thread]