qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 20/37] target/i386: reimplement 0x0f 0x60-0x6f, add AVX


From: Richard Henderson
Subject: Re: [PATCH 20/37] target/i386: reimplement 0x0f 0x60-0x6f, add AVX
Date: Tue, 13 Sep 2022 12:35:33 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

On 9/13/22 11:56, Paolo Bonzini wrote:
On Mon, Sep 12, 2022 at 1:41 PM Richard Henderson
<richard.henderson@linaro.org> wrote:

On 9/12/22 00:04, Paolo Bonzini wrote:
+/*
+ * 00 = p*  Pq, Qq (if mmx not NULL; no VEX)
+ * 66 = vp* Vx, Hx, Wx
+ *
+ * These are really the same encoding, because 1) V is the same as P when VEX.V
+ * is not present 2) P and Q are the same as H and W apart from MM/XMM
+ */
+static inline void gen_binary_int_sse(DisasContext *s, CPUX86State *env, 
X86DecodedInsn *decode,
+                                      SSEFunc_0_eppp mmx, SSEFunc_0_eppp xmm, 
SSEFunc_0_eppp ymm)

No need to inline.

Yes and no, the compiler should indeed be able to figure it out, but
both the assert() and the calls are meant to be optimized out by
inlining. So this kind of function would be even an always_inline
candidate.

Yes, I get that, I just prefer by default to allow the compiler to figure it out. Obviously there are parts of the code base where we use always_inline and more, but this part is never going to be performance critical.

Over-use of inline generally leads to Werror from clang, for the unused 
function case.

I'm not sure about that, because there are quite a few cases handled
by more complex gen_* functions, which are helper-based (so not simple
calls to gvec functions where you have maxsz/oprsz) and are not
handled by the common gen_*_sse. For example gen_CVTPI2Px,
gen_MASKMOV, gen_PSRLDQ_i, gen_SSE4a_I, gen_VCVTSI2Sx, ...  All of
these would have to add extra code to set the pointers and to clear
the high ymm bits.

Fair.

For gen_load, however, i can delay the generation using something like

static inline TCGv_ptr get_ptr0(DisasContext *s)
{
     if (s->ptr0) {
         return s->ptr0;
     }
     s->ptr0 = tcg_temp_new_ptr();
     tcg_gen_add(s->ptr0, cpu_env, ...);
     return s->ptr0;
}

Sure.

For gen_writeback, keeping gen_writeback eliminates duplicated code
and keeps the phases of disas_insn_new separated, so I prefer it
slightly. For now I'd rather leave it as is; with the above get_ptr0()
function that creates s->ptr0 lazily, perhaps gen_writeback() could do
it only if s->ptr0 is set (suggesting that a helper was used), while
gvec helpers would use the oprsz<maxsz feature. There's something to
be said for keeping the initial implementation simple of course,
especially since it's already slightly better than the code produced
by the existing decoder.

Also fair. Let's ignore the max argument for now, and address it in a subsequent phase, where we also convert more operations to gvec.

This could also be

      tcg_gen_gvec_dup_i64(MO_64, offset, 8, sse_vec_max_len, s->T1);

Yeah, it can be something like

     case MO_32:
         tcg_gen_trunc_tl_i32(s->tmp3_i32, s->T1);
         tcg_gen_gvec_dup_i32(MO_32, decode->op[0].offset, 4, vec_len,
s->tmp3_i32);
         break;


Actually, this doesn't work, because minimum vector size is 8.
This will hit the assert in check_size_align().

I've just realized that we can't just extend i32 to i64, as I was suggesting, because that will fall foul of big-endian host (L(0) is at the top half of Q(0)). So best to keep your zero + store.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]