[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compila
From: |
Alex Bennée |
Subject: |
Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x |
Date: |
Tue, 15 Jan 2019 16:01:32 +0000 |
User-agent: |
mu4e 1.1.0; emacs 26.1.91 |
Thomas Huth <address@hidden> writes:
> On 2019-01-15 15:46, Alex Bennée wrote:
>>
>> Peter Maydell <address@hidden> writes:
>>
>>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <address@hidden> wrote:
>>>>
>>>>
>>>> Richard Henderson <address@hidden> writes:
>>>>> But perhaps
>>>>>
>>>>> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>> *r = n % d;
>>>>> return n / d;
>>>>>
>>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>>> hosts.
>>>>
>>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>>> helper? Anyone got an s390x about so we can benchmark the two
>>>> approaches?
>>>
>>> The project has an s390x system available; however it's usually
>>> running merge build tests so not so useful for benchmarking.
>>> (I can set up accounts on it but that requires me to faff about
>>> figuring out how to create new accounts :-))
>>
>> I'm happy to leave this up to those who care about s390x host
>> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
>> helper getting too #ifdefy.
>
> Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got
> available:
Ahh I should have mentioned we already have the technology for this ;-)
If you build the fpu/next tree on a s390x you can then run:
./tests/fp/fp-bench f64_div
with and without the CONFIG_128 path. To get an idea of the real world
impact you can compile a foreign binary and run it on a s390x system
with:
$QEMU ./tests/fp/fp-bench f64_div -t host
And that will give you the peak performance assuming your program is
doing nothing but f64_div operations. If the two QEMU's are basically in
the same ballpark then it doesn't make enough difference. That said:
> #include <stdio.h>
> #include <time.h>
> #include <stdint.h>
>
> uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> asm("dlgr %0, %1" : "+r"(n) : "r"(d));
> *r = n >> 64;
> return n;
> }
>
> uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
> unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
> *r = n % d;
> return n / d;
> }
>
<snip>
>
> int main()
> {
> uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
> uint64_t rs = 0, rn = 0;
> clock_t start, end;
> long i;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd1(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs,
> rn);
>
> r = n1 = n0 = d = rs = rn = 0;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd2(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs,
> rn);
>
> r = n1 = n0 = d = rs = rn = 0;
>
> start = clock();
> for (i=0; i<200000000L; i++) {
> n1 += 3;
> n0 += 987654321;
> d += 0x123456789;
> rs += udiv_qrnnd3(&r, n1, n0, d);
> rn += r;
> }
> end = clock();
> printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs,
> rn);
>
> return 0;
> }
>
> ... and results with GCC v8.2.1 are (using -O2):
>
> test 1: time=609 , rs=2264924160200000000 , rn = 6136218997527160832
> test 2: time=10127 , rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=2350 , rs=2264924183048928865 , rn = 4842822048162311089
>
> Thus the int128 version is the slowest!
I'd expect a little slow down due to the indirection into libgcc.. but
that seems pretty high.
>
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
>
> Results with Clang v7.0.1 (using -O2, too) are these:
>
> test 2: time=5035 , rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=1970 , rs=2264924183048928865 , rn =
> 4842822048162311089
You can run:
./tests/fp/fp-test f64_div -l 2 -r all
For a proper comprehensive test.
--
Alex Bennée
- [qemu-s390x] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Thomas Huth, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Philippe Mathieu-Daudé, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Thomas Huth, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Richard Henderson, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/14
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Peter Maydell, 2019/01/15
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/15
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Thomas Huth, 2019/01/15
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x,
Alex Bennée <=
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Emilio G. Cota, 2019/01/15
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Thomas Huth, 2019/01/16
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/16
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Thomas Huth, 2019/01/17
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Alex Bennée, 2019/01/17
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Emilio G. Cota, 2019/01/16
- Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Richard Henderson, 2019/01/15
Re: [qemu-s390x] [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Richard Henderson, 2019/01/14
Re: [qemu-s390x] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x, Cornelia Huck, 2019/01/16