bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Bug-gnubg] Vectorizing 3rd step


From: macherius
Subject: RE: [Bug-gnubg] Vectorizing 3rd step
Date: Wed, 20 Apr 2005 15:03:39 +0200

Oytein,


> | Remove:
> |   float scale[4];
> |   v4sf scalevector;
> |   scale[0] = scale[1] = scale[2] = scale[3] = ari;
> |   scalevector = __builtin_ia32_loadaps(scale);
> |
> | Use:
> |   v4sf scalevector = (v4sf) { ari, ari, ari, ari };

I guess the whole approach of a union was a bad idea of mine. It generates
memory-access savy code, and until an optimizer removes it it will be bad
assembly code. Obviously gcc is not smart enough to remove it. Look at what
was listing 6 in my original posting on intrinsics:

----------------------------------------------------------------------
---------------- Listing 6: ------------------------------------------
----------------- ICC 8.1.24 code (intrinsics) -----------------------
------------------ (Inner loop only [j]) -----------------------------
----------------------------------------------------------------------

for( j = 0; j < pnn->cHidden; j++ ) {
    /* r += ar[ j ] * *prWeight++; */
        __m128 vec0, vec1, vec2;
        const int k =0;

        vec0 = _mm_load_ps ((float *) ar);
        vec1 = _mm_load_ps ((float *) prWeight);

        vec2 = _mm_setzero_ps();
        // loop is fully unrolled for the N=128 case in this example
        // #pragma vector aligned (not needed if we are unrolled)
        // for (k=0; k<pnn->cHidden/4;k+=16) { /* 128/4 = 32, OK! */
                // vec0 = ;
                // vec1 = ;
                vec0 = _mm_mul_ps(_mm_load_ps(ar + k*4),
_mm_load_ps(prWeight + k*4));
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+1)*4);
                vec1 = _mm_load_ps(prWeight + (k+1)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+2)*4);
                vec1 = _mm_load_ps(prWeight + (k+2)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+3)*4);
                vec1 = _mm_load_ps(prWeight + (k+3)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+4)*4);
                vec1 = _mm_load_ps(prWeight + (k+4)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+5)*4);
                vec1 = _mm_load_ps(prWeight + (k+5)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+6)*4);
                vec1 = _mm_load_ps(prWeight + (k+6)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
                vec0 = _mm_load_ps(ar + (k+7)*4);
                vec1 = _mm_load_ps(prWeight + (k+7)*4);
                vec0 = _mm_mul_ps(vec0, vec1);
                vec2 = _mm_add_ps(vec2, vec0);
        // }
        /* r = a b c d
        swapLo = b a d c
        sumLo = a+b b+a c+d d+c
        swapHi = c+d c+d a+b a+b
        sum = 4 copies of a+b+d+c
        input is vec2, aux are vec1 & vec0
        */
        /* __m128  swapLo */ vec0 = _mm_shuffle_ps(vec2,vec2,
_MM_SHUFFLE(2,3,0,1));
        /* __m128  sumLo */ vec1 = _mm_add_ps(vec2, vec0);
        /*__m128  swapHi */ vec0 = _mm_shuffle_ps(vec1,vec1,
_MM_SHUFFLE(1,1,3,3));
        /*__m128  sum */ vec2 = _mm_add_ps(vec1,vec0);
        _mm_store_ss (&r, vec2); 
}

The trick is in the last few lines. We are actually not interested in any
particular partial sum, what is best is to do 4 sums and in a final step
combine them into an overall sum. By doing so, no memory locations are
needed to be accessed at all (beside the input data and the final result).
Please note that the sum is calculated 4 times (in each of the 4 SSE partial
32-bit registers), which is not slower than doing it in one (the lower)
only. Only in the last step (_mm_store_ss (&r, vec2)) there is a memory
write.

Should not be too hard to redo this in gcc intrinsics.

Ingo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]