libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Libunwind-devel] libunwind x86-64 optimisations?


From: Lassi Tuura
Subject: [Libunwind-devel] libunwind x86-64 optimisations?
Date: Mon, 6 Jul 2009 13:24:14 +0200

Hi,

I am one of the maintainers of a certain performance and memory use profiling package. We've worked hard at providing a low-overhead profiler for big software, and think we have done a fairly decent job on IA32 Linux. (It's called igprof.)
In order to support x86-64, I've looked at using libunwind 0.99.

It seems to work mostly for us, which is a relief, but I have a few concerns and patches and would really welcome your feedback.
#1) libunwind seems to be reliable but not 100% async signal safe. In  
particular if called from signal handler (SIGPROF) at an inopportune  
time it may dead-lock. Specifically, if we get a profiling signal  
exactly when dynamic linker is inside pthread_mutex_* or is already  
holding a lock, and libunwind calls into dl_iterate_phdr() (NB; from  
the same thread already holding a lock or trying to change it), bad  
things will happen, usually a dead-lock.
I'm currently entertaining the theory that either a crash from  
walking the elf headers in memory (without dl_iterate_phdr() and its  
locks) is less likely to crash than dead-locking inside the dynamic  
linker, or I should try to discard profile signals while inside the  
dynamic linker.
Thoughts?

#2) libunwind appears to make heavy use of sigprocmask(), e.g. around every mutex lock/unlock operation. This causes a colossal slow-down with two syscalls per trip. Removing those syscalls makes a massive performance difference, but I assume they were there for a reason? (Included in patch 2.)
#3) libunwind resorts to scanning /proc/self/maps if it can't find an  
executable elf image (find_binary_for_address). On a platform we use  
a lot (RHEL4-based) this happens always for the main executable,  
causing a trip to scan /proc/self/maps for every stack level in the  
main program, which is, ahem, slow :-) I made a hack-ish (not thread  
safe) fix for this: one to determine the value just once, and for  
another, using readlink(/proc/self/exe) instead of scanning /proc/ 
self/maps. I've never seen a null pointer for anything other than the  
main program. Can that really happen in other circumstances? Do you  
do it the present way because of some JIT situation? (Included in  
patch 2.)
#4) We appear to be blessed with a lot of libraries which have  
insufficient dwarf unwind info, and by the looks of it, the RHEL4  
GLIBC in particular. It looks like dwarf_find_save_locs() caches  
recent uses, but falls back on a full search if not found in cache.  
It turned out in our case the cache was hardly used, because it  
didn't cache negative results. I added some code to rs_new() to  
remember "bogus" (negative) results, and code in dwarf_find_save_locs 
() to cache negative replies from fetch_proc_info() and  
create_state_record_for(), and it did help performance a lot. However  
I am really unsure if I did it correctly and would appreciate another  
pair of eyes over this. (Included in patch 2, at the bottom.)
With all these changes libunwind is better, but we still have very  
heavy penalty for unwinding; it's about 150 times slower than our  
IA-32 unwinding code, and about 5-10 times slower in real-world apps  
(tracking every memory allocation at ~ 1M allocations per second). I  
do realise x86-64 is much harder to unwind, but I am looking for ways  
to optimise it further; some ideas below.
I realise our situation is a little bit pushing it, what with  
multiple threads and ~200 MB of code and VSIZE of 1-3GB.
Patches:

1) C++ compatibility clean-up

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- cleanup.patch?revision=1.1&view=markup
2) Various performance optimisations

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- optimise.patch?revision=1.5&view=markup
Ideas:

1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE).

2) Try to determine which frames are "varying", i.e. uses VLAs or alloca(). If there are none in the call stack, just cache incoming CFA vs. outgoing CFA difference for every call site, and unwind that way just that way. Otherwise revert to slow unwind at least until you get past the varying frames. Specifically, walk from the top, probing a cache for CFA delta + varying marker. If you make it all the way to the top with the cache, return call stack. If not, switch back to normal slow unwind, update cache, and go all the way to top. Alas, I have currently no idea how to identify alloca/vla-using frames. Any ideas?
Lassi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]