qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx


From: Stefan Brankovic
Subject: Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
Date: Thu, 29 Aug 2019 15:34:24 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0


On 27.8.19. 20:52, Richard Henderson wrote:
On 8/27/19 2:37 AM, Stefan Brankovic wrote:
+    for (i = 0; i < 4; i++) {
+        switch (i) {
+        case 0:
+            /*
+             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 1 and 2.
+             */
+            get_avr64(avr, VA, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 1:
+            /*
+             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 3 and 4.
+             */
+            get_avr64(avr, VA, false);
+            break;
+        case 2:
+            /*
+             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 5 and 6.
+             */
+            get_avr64(avr, VB, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 3:
+            /*
+             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 7 and 8.
+             */
+            get_avr64(avr, VB, false);
+            break;
+        }
+        /* Perform the packing for 2 pixels(each iteration for 1). */
+        tcg_gen_movi_i64(tmp, 0x0ULL);
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
+            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
+            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
+            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+        }
+        if ((i == 0) || (i == 2)) {
+            tcg_gen_shli_i64(tmp, tmp, 32);
+        }
+        tcg_gen_or_i64(result, result, tmp);
+        if (i == 1) {
+            /* Place packed pixels 1:4 to high doubleword of vD. */
+            tcg_gen_mov_i64(result1, result);
+        }
+        if (i == 3) {
+            /* Place packed pixels 5:8 to low doubleword of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
I really have a hard time believing that it is worthwhile to inline all of this
code.  By my count this is 82 non-move opcodes.  That is a *lot* of inline
expansion.

However, I can well imagine that the existing out-of-line helper is less than
optimal.

-void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
-{
-    int i, j;
-    ppc_avr_t result;
-#if defined(HOST_WORDS_BIGENDIAN)
-    const ppc_avr_t *x[2] = { a, b };
-#else
-    const ppc_avr_t *x[2] = { b, a };
-#endif
-
-    VECTOR_FOR_INORDER_I(i, u64) {
-        VECTOR_FOR_INORDER_I(j, u32) {
-            uint32_t e = x[i]->u32[j];
Double indirect loads?

-
-            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
-                                     ((e >> 6) & 0x3e0) |
-                                     ((e >> 3) & 0x1f));
Store to temporary ...

-        }
-    }
-    *r = result;
... and then copy?

Try replacing the existing helper with something like the following.


r~



static inline uint64_t pkpx_1(uint64_t a, int shr, int shl)
{
     uint64_t r;

     r  = ((a >> (shr + 9)) & 0x3f) << shl;
     r |= ((a >> (shr + 6)) & 0x1f) << shl;
     r |= ((a >> (shr + 3)) & 0x1f) << shl;

     return r;
}

static inline uint64_t pkpx_2(uint64_t ah, uint64_t al)
{
     return pkpx_1(ah, 32, 48)
          | pkpx_1(ah,  0, 32)
          | pkpx_1(al, 32, 16)
          | pkpx_1(al,  0,  0);
}

void helper_vpkpx(uint64_t *r, uint64_t *a, uint64_t *b)
{
     uint64_t rh = pkpx_2(a->VsrD(0), a->VsrD(1));
     uint64_t rl = pkpx_2(b->VsrD(0), b->VsrD(1));
     r->VsrD(0) = rh;
     r->VsrD(1) = rl;
}

I implemented vpkpx as you suggested above with small modifications(so it builds and gives correct result). It looks like this:

static inline uint64_t pkpx_1(uint64_t a, int shr, int shl)
{
    uint64_t r;

    r  = ((a >> (shr + 9)) & 0xfc00) << shl;
    r |= ((a >> (shr + 6)) & 0x3e0) << shl;
    r |= ((a >> (shr + 3)) & 0x1f) << shl;

    return r;
}

static inline uint64_t pkpx_2(uint64_t ah, uint64_t al)
{
    return pkpx_1(ah, 32, 48)
         | pkpx_1(ah,  0, 32)
         | pkpx_1(al, 32, 16)
         | pkpx_1(al,  0,  0);
}

void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
    uint64_t rh = pkpx_2(a->u64[1], a->u64[0]);
    uint64_t rl = pkpx_2(b->u64[1], b->u64[0]);
    r->u64[1] = rh;
    r->u64[0] = rl;
}

I also noticed that this would work only for little_endian hosts, so we would need to modify it in order to support big_endian hosts (this shouldn't affect performance results).

Then I run my performance tests and I got following results(test is calling vpkpx 100000 times):

1) Current helper implementation: ~ 157 ms

2) helper implementation you suggested: ~94 ms

3) tcg implementation: ~75 ms

Attached file contains assembly code for both current implementation and implementation you suggested, so please take a look at that as well.

Kind Regards,

Stefan

Attachment: vpkpx_assembly.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]