[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Libunwind-devel] libunwind x86-64 optimisations?
From: |
Lassi Tuura |
Subject: |
[Libunwind-devel] libunwind x86-64 optimisations? |
Date: |
Mon, 6 Jul 2009 13:24:14 +0200 |
Hi,
I am one of the maintainers of a certain performance and memory use
profiling package. We've worked hard at providing a low-overhead
profiler for big software, and think we have done a fairly decent job
on IA32 Linux. (It's called igprof.)
In order to support x86-64, I've looked at using libunwind 0.99.
It seems to work mostly for us, which is a relief, but I have a few
concerns and patches and would really welcome your feedback.
#1) libunwind seems to be reliable but not 100% async signal safe. In
particular if called from signal handler (SIGPROF) at an inopportune
time it may dead-lock. Specifically, if we get a profiling signal
exactly when dynamic linker is inside pthread_mutex_* or is already
holding a lock, and libunwind calls into dl_iterate_phdr() (NB; from
the same thread already holding a lock or trying to change it), bad
things will happen, usually a dead-lock.
I'm currently entertaining the theory that either a crash from
walking the elf headers in memory (without dl_iterate_phdr() and its
locks) is less likely to crash than dead-locking inside the dynamic
linker, or I should try to discard profile signals while inside the
dynamic linker.
Thoughts?
#2) libunwind appears to make heavy use of sigprocmask(), e.g. around
every mutex lock/unlock operation. This causes a colossal slow-down
with two syscalls per trip. Removing those syscalls makes a massive
performance difference, but I assume they were there for a reason?
(Included in patch 2.)
#3) libunwind resorts to scanning /proc/self/maps if it can't find an
executable elf image (find_binary_for_address). On a platform we use
a lot (RHEL4-based) this happens always for the main executable,
causing a trip to scan /proc/self/maps for every stack level in the
main program, which is, ahem, slow :-) I made a hack-ish (not thread
safe) fix for this: one to determine the value just once, and for
another, using readlink(/proc/self/exe) instead of scanning /proc/
self/maps. I've never seen a null pointer for anything other than the
main program. Can that really happen in other circumstances? Do you
do it the present way because of some JIT situation? (Included in
patch 2.)
#4) We appear to be blessed with a lot of libraries which have
insufficient dwarf unwind info, and by the looks of it, the RHEL4
GLIBC in particular. It looks like dwarf_find_save_locs() caches
recent uses, but falls back on a full search if not found in cache.
It turned out in our case the cache was hardly used, because it
didn't cache negative results. I added some code to rs_new() to
remember "bogus" (negative) results, and code in dwarf_find_save_locs
() to cache negative replies from fetch_proc_info() and
create_state_record_for(), and it did help performance a lot. However
I am really unsure if I did it correctly and would appreciate another
pair of eyes over this. (Included in patch 2, at the bottom.)
With all these changes libunwind is better, but we still have very
heavy penalty for unwinding; it's about 150 times slower than our
IA-32 unwinding code, and about 5-10 times slower in real-world apps
(tracking every memory allocation at ~ 1M allocations per second). I
do realise x86-64 is much harder to unwind, but I am looking for ways
to optimise it further; some ideas below.
I realise our situation is a little bit pushing it, what with
multiple threads and ~200 MB of code and VSIZE of 1-3GB.
Patches:
1) C++ compatibility clean-up
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-
cleanup.patch?revision=1.1&view=markup
2) Various performance optimisations
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-
optimise.patch?revision=1.5&view=markup
Ideas:
1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE).
2) Try to determine which frames are "varying", i.e. uses VLAs or
alloca(). If there are none in the call stack, just cache incoming
CFA vs. outgoing CFA difference for every call site, and unwind that
way just that way. Otherwise revert to slow unwind at least until you
get past the varying frames. Specifically, walk from the top, probing
a cache for CFA delta + varying marker. If you make it all the way to
the top with the cache, return call stack. If not, switch back to
normal slow unwind, update cache, and go all the way to top. Alas, I
have currently no idea how to identify alloca/vla-using frames. Any
ideas?
Lassi
- [Libunwind-devel] libunwind x86-64 optimisations?,
Lassi Tuura <=
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Richard Henderson, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Arun Sharma, 2009/07/07