On 11/7/24 12:58, Daniel Henrique Barboza wrote:
On 11/4/24 9:48 AM, Richard Henderson wrote:
On 10/30/24 15:25, Paolo Savini wrote:
On 10/30/24 11:40, Richard Henderson wrote:
__builtin_memcpy DOES NOT equal VMOVDQA
I am aware of this. I took __builtin_memcpy as a generic enough way
to emulate loads and stores that should allow several hosts to
generate the widest load/store instructions they can and on x86 I
see this generates instructions vmovdpu/movdqu that are not always
guaranteed to be atomic. x86 though guarantees them to be atomic if
the memory address is aligned to 16 bytes.
No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
See the comment in util/cpuinfo-i386.c, and the two
CPUINFO_ATOMIC_VMOVDQ[AU] bits.
See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO,
and atomic16_read_ro.
Not that I think you should use that here; it's complicated, and I
think you're better off relying on the code in accel/tcg/ when more
than byte atomicity is required.
Not sure if that's what you meant but I didn't find any clear example of
multi-byte atomicity using qatomic_read() and friends that would be
closer
to what memcpy() is doing here. I found one example in
bdrv_graph_co_rdlock()
that seems to use a mem barrier via smp_mb() and qatomic_read() inside a
loop, but I don't understand that code enough to say.
Memory barriers provide ordering between loads and stores, but they
cannot be used to address atomicity of individual loads and stores.
I'm also wondering if a common pthread_lock() wrapping up these
memcpy() calls
would suffice in this case. Even if we can't guarantee that
__builtin_memcpy()
will use arch specific vector insns in the host it would already be a
faster
path than falling back to fn(...).
Locks would certainly not be faster than calling the accel/tcg function.
In a quick detour, I'm not sure if we really considered how ARM SVE
implements these
helpers. E.g gen_sve_str():
https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/translate-sve.c#L4182
Note that ARM SVE defines these instructions to have byte atomicity.
r~