libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libunwind-devel] Crash in libunwind 0.99 on x86_64


From: Konstantin Belousov
Subject: Re: [Libunwind-devel] Crash in libunwind 0.99 on x86_64
Date: Wed, 21 Apr 2010 15:44:11 +0300
User-agent: Mutt/1.4.2.3i

On Tue, Apr 20, 2010 at 07:02:02PM -0500, Don Maghrak wrote:
> On 4/20/2010 4:25 PM, Lassi Tuura wrote:
> >Hi,
> >
> >>>Thanks Arun. I suspect the git version would fix the crash I saw, if
> >>>it occurred at the exact same address in the first page again, however
> >>>I wanted to understand the likelihood of getting a crash at some other
> >>>place in the future.
> >>
> >>The likelihood depends on the compiler version used, presence of 
> >>handwritten asm, third party libraries without proper unwind information 
> >>etc.
> >>
> >>There are better solutions possible  (eg: implementing interfaces to 
> >>query valid stack addresses in the threading library you're using), but 
> >>they require modifications to other pieces of low level code (eg: libc).
> >
> >We found that when interrupted by signals - such as sampling performance 
> >profiler - it's essential to turn the validation always on. Otherwise 
> >sooner or later the application will crash in libunwind accessing bad 
> >memory address.
> >
> Hi Lassi,
> We find the same with regards to signals and callstack profiling
> in the OpenSpeedShop tool. We typically patch src/x86_64/Gstep.c as the
> systems we currently support typically crashed in access_mem. I understand
> that the libunwind maintainers are concerned with performance issues
> when validation is always on and maybe a configuration option to
> force validation is needed to get such a patch applied (i.e. 
> src/x86_64/Ginit_local.c
> setting c->validate = 1).
> 
> We turn validation on at the "Try DWARF-base unwinding..." in GStep.c:
> *** libunwind-20100123/src/x86_64/Gstep.c       2010-02-08 
> 11:34:10.000000000 -0500
> --- libunwind-0.99-X/src/x86_64/Gstep.c 2009-05-12 15:28:27.000000000 -0500
> ***************
> *** 39,44 ****
> --- 39,47 ----
>          c, (unsigned long long) c->dwarf.ip);
> 
>     /* Try DWARF-based unwinding... */
> +   /* need to validate here too.  Intel compiler generated code
> +    * crashes with segv and sigbus on large mvapich jobs. */
> +   c->validate = 1;
>     ret = dwarf_step (&c->dwarf);
> 
>     if (ret < 0 && ret != -UNW_ENOINFO)
> *** libunwind-20100123/src/x86_64/Gis_signal_frame.c    2010-02-08 
> 11:34:10.000000000 -0500
> --- libunwind-0.99-X/src/x86_64/Gis_signal_frame.c      2009-05-12 
> 15:27:21.000000000 -0500
> ***************
> *** 38,43 ****
> --- 38,44 ----
>     void *arg;
>     int ret;
> 
> +   c->validate = 1;
>     as = c->dwarf.as;
>     a = unw_get_accessors (as);
>     arg = c->dwarf.as_arg;
> 
> This works for us on the nastiest cases we have seen (very large
> simulation code at LLNL) and we do not see a noticeable performance hit
> in the callstack profiler we use. That particular app would eventually
> access bad memory attempting to unwind through Intel's fast memcpy routines.
> We would also notice memory access crashes at high cpu counts when
> profiling large mpi jobs.  The above patch fixed the crashes (we still see
> a very small number of truncated callstacks that are likely related to
> other issues your patches appear to address).
> We have also successfully profiled a large mpi benchmark (12000 cores)
> on a cray-xt5 using libunwind with the above patch.
> 
> Thanks for your work on this!
> 
> regards,
> Don
> 
> >By far the biggest reason for this is inaccurate unwind information for 
> >function epilogues - the exit paths from the function don't have any 
> >unwind info, causing endless havoc if you happen to sample the stack 
> >there. There have been a number of recent updates to GCC on this, but I am 
> >not sure if they all made it even to 4.5.0 which was released just a few 
> >days ago. Anything before 4.5.0 is certainly prone to have significant 
> >issues of this sort.
> >
> >GDB will also fail to produce a useful stack trace in comparable 
> >circumstances. The fix needs to come from the compiler.
> >
> >Similar caveats of course apply to debug info produced by other means. One 
> >version of GLIBC I looked at has incorrect (manually entered) unwind info 
> >for at least one function.
> >
> >Regards,
> >Lassi

I think that access validation currently done by msync(2) is not enough
for rare case of unaligned access at the end of the page with next page
unmapped.

Also, mincore(2) looks like as a better way to test the presence of a
mapping at some address. I use the patch below.

diff --git a/configure.in b/configure.in
index fcc8813..100df65 100644
--- a/configure.in
+++ b/configure.in
@@ -76,7 +76,7 @@ dnl Checks for library functions.
 AC_FUNC_MEMCMP
 AC_TYPE_SIGNAL
 AC_CHECK_FUNCS(dl_iterate_phdr dl_phdr_removals_counter dlmodinfo getunwind \
-               ttrace)
+               ttrace mincore)
 is_gcc_m64() {
  if test `echo $CFLAGS | grep "\-m64" -c` -eq 1 ; then echo ppc64;
  else
diff --git a/src/x86/Ginit.c b/src/x86/Ginit.c
index 949bf28..0af616a 100644
--- a/src/x86/Ginit.c
+++ b/src/x86/Ginit.c
@@ -24,6 +24,10 @@ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN 
AN ACTION
 OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.  */
 
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
 #include <stdlib.h>
 #include <string.h>
 
@@ -83,6 +87,9 @@ static int
 validate_mem (unw_word_t addr)
 {
   int i, victim;
+#ifdef HAVE_MINCORE
+  char mvec[2]; /* Unaligned access may cross page boundary */
+#endif
 
   addr = PAGE_START(addr);
 
@@ -95,7 +102,11 @@ validate_mem (unw_word_t addr)
        return 0;
     }
 
-  if (msync ((void *) addr, 1, MS_ASYNC) == -1)
+#ifdef HAVE_MINCORE
+  if (mincore ((void *) addr, sizeof (unw_word_t), mvec) == -1)
+#else
+  if (msync ((void *) addr, sizeof (unw_word_t), MS_ASYNC) == -1)
+#endif
     return -1;
 
   victim = lga_victim;
diff --git a/src/x86_64/Ginit.c b/src/x86_64/Ginit.c
index f222bf5..14614f1 100644
--- a/src/x86_64/Ginit.c
+++ b/src/x86_64/Ginit.c
@@ -26,6 +26,10 @@ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN 
AN ACTION
 OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.  */
 
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
 #include <stdlib.h>
 #include <string.h>
 #include <sys/mman.h>
@@ -86,6 +90,9 @@ static int
 validate_mem (unw_word_t addr)
 {
   int i, victim;
+#ifdef HAVE_MINCORE
+  char mvec[2]; /* Unaligned access may cross page boundary */
+#endif
 
   addr = PAGE_START(addr);
 
@@ -98,7 +105,11 @@ validate_mem (unw_word_t addr)
        return 0;
     }
 
-  if (msync ((void *) addr, 1, MS_ASYNC) == -1)
+#ifdef HAVE_MINCORE
+  if (mincore ((void *) addr, sizeof (unw_word_t), mvec) == -1)
+#else
+  if (msync ((void *) addr, sizeof (unw_word_t), MS_ASYNC) == -1)
+#endif
     return -1;
 
   victim = lga_victim;

Attachment: pgpNbsmpOsM8x.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]