lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Treacherous gcc-10 defect


From: Vadim Zeitlin
Subject: Re: [lmi] Treacherous gcc-10 defect
Date: Sat, 19 Dec 2020 20:37:53 +0100

 Sorry for the delay with response, I hoped to be able to shed some extra
light on this issue, but after trying, and failing, to find anything really
useful, I've decided to finally reply to this email, even though I still
don't have anything constructive to propose.

On Thu, 10 Dec 2020 23:14:10 +0000 Greg Chicares <gchicares@sbcglobal.net> 
wrote:

GC> On 12/10/20 1:26 PM, Greg Chicares wrote:
GC> [...a certain '-fno-omit-frame-pointer' testcase...]
GC> > succeeds with gcc-10, but fails with gcc-8. Accordingly, I'll
GC> > restrict the '-fomit-frame-pointer' workaround to
GC> > x86_64-w64-mingw32-gcc-8.x only, so that it doesn't propagate
GC> > to gcc-10 builds when we upgrade the compiler (very soon).
GC> 
GC> TL;DR: x86_64-w64-mingw32 gcc-10 seems to need '-fomit-frame-pointer'.

 This looks extraordinarily bad. I couldn't find any existing bug for this
in gcc bugzilla, do you think it would be worth spending time on providing
the minimal reproducible test case and reporting it?

GC> Earlier today I ran this command in a chroot with MinGW-w64 gcc-10:
GC>   make raze; ./nychthemeral_test.sh
GC>   ['raze' is a brutally emphatic 'clean' target]
GC> and observed the following output (only with gcc-10; not with
GC> MinGW-w64 gcc-8).
GC> 
GC> LMI_TRIPLET = "x86_64-w64-mingw32"
GC>   Production system built--ready to start GUI test in another session.
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? test failed:   '0.666666666666667' == '0'
GC> ???? 6 test errors detected; 472 tests succeeded
GC> ???? returning with error code 201
GC> ???? 
GC> ???? errors detected; see stdout for details

 I can confirm that I can reproduce the problem here too, using x86_64
cross-compiler. This doesn't happen when using i686 MinGW cross-compiler or
using the native x86_64 compiler.

GC> The failure arose here:
GC>     test_interconvertibility(y, "0.666666666666667" , __FILE__, __LINE__);
GC> and, at a wild guess, it looks as though the compiler treated the
GC> expression "(2.0 / 3.0)" as integral.

 Looking at the generated code, this is not the case. What actually happens
here is this:

1. The compiler doesn't compute 2.0/3.0 on its own, it generates code to do
   it during run-time, using SSE2 xmm registers. I'm not sure what exactly
   prevents it from doing constant folding, but something does.

2. When calling test_interconvertibility((2.0 / 3.0)) at line 386 of
   numeric_io_test.cpp, it doesn't recompute this result but reuses the
   same xmm6 register it used when computing for a previous call to the
   same function at the line 280, i.e. simply does "$xmm0 = $xmm6" before
   calling the function, which takes its first double parameter in xmm0 in
   the usual calling convention.

3. The calling code is exactly the same when using and not using
   -f[no-]omit-frame-pointer and while the code in
   test_interconvertibility() is quite different (not only because it uses
   rbp as another general purpose register, which it does only a couple of
   times anyhow, but because all stack addresses are different, so it's a
   bit difficult to reconcile them between the 2 versions), it does save
   and restore xmm6 on entering and exiting the function in both cases.

4. Yet something does change the value of xmm6 in the build using
   -fno-omit-frame-pointer between the two calls because it's clearly wrong
   when the function is called again (it's not actually 0, but 4.94066e-324
   which is less random than it might appear because it corresponds to the
   IEEE-754 64-bit double value of exactly 1). I couldn't hunt down where
   exactly is it being changed yet, but I think I should be able to, if I
   spend more time on this. The problem is that I'm not sure if it's going
   to be really useful, producing a minimal example reproducing the problem
   would probably be more so. But it still would take quite some time. Do
   you think it would be useful to spend it on this?

GC> An alternative and wilder guess
GC> is that '-fno-omit-frame-pointer' causes the error, but generating
GC> incorrect code without any diagnostic seemed so implausible

 Yet this is almost certainly what happens here.

 Regards,
VZ

Attachment: pgpVlg43cBNec.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]