[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Libunwind-devel] [PATCH 0/1] Fast back-trace for x86_64 for only collec
From: |
Lassi Tuura |
Subject: |
[Libunwind-devel] [PATCH 0/1] Fast back-trace for x86_64 for only collecting the call stack |
Date: |
Sat, 24 Apr 2010 11:37:12 +0200 |
Hi,
This patch adds new function to perform a pure stack walk without
unwinding, functionally similar to backtrace() but accelerated by an
address attribute cache the caller maintains across calls.
The feature is for now only implemented for x86_64 linux. The patch
also improves DWARF-less RBP-based frame-chain traversal slightly.
The patch adds a new test to ensure the feature works properly,
i.e. returns the same addresses as backtrace(). The test has a fudge
factor, it seems backtrace() may return off-by-one addresses which the
client probably shouldn't use for symbol lookup, please see the
ongoing discussion on is_interrupted vs. is_signal vs. use_prev_instr.
Some statistics on the unwinding improvements follow. I used igprof
on three single-threaded applications with memory allocation tracing
sampling on every malloc() call, and statistical performance profiling
sampling at 6.0ms interval. Without instrumentation the applications
run 77-734 seconds, fully utilise one CPU, and perform 63M-606M memory
allocations: 800-900k per second on average. Each application loads
250 MB worth of code from 594 shared libraries.
The test used 2 million entry address cache, and hit up to 282k unique
call sites. The trace spent 100-130 TSC cycles per cached address on
RHEL5.4, GCC 4.5.0, 2x4-core Intel Xeon E5410 2.33GHz, 16 GB RAM.
In the results, walks is the number of stack walks; frames the number
of stack frames in total; steps the number of unw_step() calls; orig
and prof the user + system times in seconds for the original and
instrumented application runs. System time is negligible in all cases
except normal memory tracing which had ~15% system to user time ratio.
Performance profiling results, normal vs. fast tracing.
app orig prof walks frames steps
minbias 76.8 78.9 +2.7% 12'993 533'374 =frames
ttbar 333.8 338.6 +1.4% 56'167 2'076'515 =frames
qcd 733.6 740.1 +0.9% 123'045 4'294'584 =frames
minbias 76.8 78.2 +1.8% 12'866 530'533 18'203
ttbar 333.8 334.2 +0.1% 55'441 2'048'892 39'702
qcd 733.6 732.2 -0.2% 121'723 4'250'743 50'711
Memory allocation tracing results, normal vs. fast tracing.
app orig prof walks frames steps
minbias 76.8 1145 +1390% 63.3M 2305M =frames
ttbar 333.8 5137 +1439% 299.6M 10193M =frames
qcd 733.6 10507 +1332% 605.6M 20210M =frames
minbias 76.8 299.1 +289% 63.3M 2305M 277'744
ttbar 333.8 1185 +255% 299.6M 10193M 281'636
qcd 733.6 2400 +227% 605.6M 20210M 281'576
The address cache for fast trace probed hash must be large enough or
performance falls off the cliff, getting *much* slower than unw_step()
loop was. Combined probe distribution for the three memory profiles:
781397 >trace_lookup: updating slot after 0 steps
50970 >trace_lookup: updating slot after 1 steps
6911 >trace_lookup: updating slot after 2 steps
1235 >trace_lookup: updating slot after 3 steps
330 >trace_lookup: updating slot after 4 steps
68 >trace_lookup: updating slot after 5 steps
31 >trace_lookup: updating slot after 6 steps
8 >trace_lookup: updating slot after 7 steps
3 >trace_lookup: updating slot after 8 steps
1 >trace_lookup: updating slot after 9 steps
1 >trace_lookup: updating slot after 10 steps
1 >trace_lookup: updating slot after 11 steps
Regards,
Lassi
---
Lassi Tuura (1):
Fast back-trace for x86_64 for only collecting the call stack.
include/dwarf.h | 1
include/libunwind-x86_64.h | 33 +++
include/tdep-arm/libunwind_i.h | 1
include/tdep-hppa/libunwind_i.h | 1
include/tdep-ia64/libunwind_i.h | 1
include/tdep-mips/libunwind_i.h | 1
include/tdep-ppc32/libunwind_i.h | 3
include/tdep-ppc64/libunwind_i.h | 3
include/tdep-x86/libunwind_i.h | 1
include/tdep-x86_64/libunwind_i.h | 6 -
src/Makefile.am | 8 -
src/arm/init.h | 1
src/dwarf/Gparser.c | 4
src/hppa/init.h | 1
src/mips/init.h | 1
src/ppc32/init.h | 1
src/ppc64/init.h | 1
src/x86/init.h | 1
src/x86_64/Ginit_local.c | 4
src/x86_64/Gos-linux.c | 33 +--
src/x86_64/Gstash_frame.c | 92 ++++++++
src/x86_64/Gstep.c | 44 +++-
src/x86_64/Gtrace.c | 401 +++++++++++++++++++++++++++++++++++++
src/x86_64/Lstash_frame.c | 5
src/x86_64/Ltrace.c | 5
src/x86_64/init.h | 1
tests/Gtest-trace.c | 265 ++++++++++++++++++++++++
tests/Ltest-trace.c | 5
tests/Makefile.am | 3
tests/check-namespace.sh.in | 6 +
30 files changed, 892 insertions(+), 41 deletions(-)
create mode 100644 src/x86_64/Gstash_frame.c
create mode 100644 src/x86_64/Gtrace.c
create mode 100644 src/x86_64/Lstash_frame.c
create mode 100644 src/x86_64/Ltrace.c
create mode 100644 tests/Gtest-trace.c
create mode 100644 tests/Ltest-trace.c
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Libunwind-devel] [PATCH 0/1] Fast back-trace for x86_64 for only collecting the call stack,
Lassi Tuura <=