libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Libunwind-devel] libunwind x86-64 optimisations?


From: Lassi Tuura
Subject: [Libunwind-devel] libunwind x86-64 optimisations?
Date: Mon, 6 Jul 2009 13:24:14 +0200

Hi,

I am one of the maintainers of a certain performance and memory use profiling package. We've worked hard at providing a low-overhead profiler for big software, and think we have done a fairly decent job on IA32 Linux. (It's called igprof.)

In order to support x86-64, I've looked at using libunwind 0.99.

It seems to work mostly for us, which is a relief, but I have a few concerns and patches and would really welcome your feedback.

#1) libunwind seems to be reliable but not 100% async signal safe. In particular if called from signal handler (SIGPROF) at an inopportune time it may dead-lock. Specifically, if we get a profiling signal exactly when dynamic linker is inside pthread_mutex_* or is already holding a lock, and libunwind calls into dl_iterate_phdr() (NB; from the same thread already holding a lock or trying to change it), bad things will happen, usually a dead-lock.

I'm currently entertaining the theory that either a crash from walking the elf headers in memory (without dl_iterate_phdr() and its locks) is less likely to crash than dead-locking inside the dynamic linker, or I should try to discard profile signals while inside the dynamic linker.

Thoughts?

#2) libunwind appears to make heavy use of sigprocmask(), e.g. around every mutex lock/unlock operation. This causes a colossal slow-down with two syscalls per trip. Removing those syscalls makes a massive performance difference, but I assume they were there for a reason? (Included in patch 2.)

#3) libunwind resorts to scanning /proc/self/maps if it can't find an executable elf image (find_binary_for_address). On a platform we use a lot (RHEL4-based) this happens always for the main executable, causing a trip to scan /proc/self/maps for every stack level in the main program, which is, ahem, slow :-) I made a hack-ish (not thread safe) fix for this: one to determine the value just once, and for another, using readlink(/proc/self/exe) instead of scanning /proc/ self/maps. I've never seen a null pointer for anything other than the main program. Can that really happen in other circumstances? Do you do it the present way because of some JIT situation? (Included in patch 2.)

#4) We appear to be blessed with a lot of libraries which have insufficient dwarf unwind info, and by the looks of it, the RHEL4 GLIBC in particular. It looks like dwarf_find_save_locs() caches recent uses, but falls back on a full search if not found in cache. It turned out in our case the cache was hardly used, because it didn't cache negative results. I added some code to rs_new() to remember "bogus" (negative) results, and code in dwarf_find_save_locs () to cache negative replies from fetch_proc_info() and create_state_record_for(), and it did help performance a lot. However I am really unsure if I did it correctly and would appreciate another pair of eyes over this. (Included in patch 2, at the bottom.)

With all these changes libunwind is better, but we still have very heavy penalty for unwinding; it's about 150 times slower than our IA-32 unwinding code, and about 5-10 times slower in real-world apps (tracking every memory allocation at ~ 1M allocations per second). I do realise x86-64 is much harder to unwind, but I am looking for ways to optimise it further; some ideas below.

I realise our situation is a little bit pushing it, what with multiple threads and ~200 MB of code and VSIZE of 1-3GB.

Patches:

1) C++ compatibility clean-up

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- cleanup.patch?revision=1.1&view=markup

2) Various performance optimisations

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind- optimise.patch?revision=1.5&view=markup

Ideas:

1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE).

2) Try to determine which frames are "varying", i.e. uses VLAs or alloca(). If there are none in the call stack, just cache incoming CFA vs. outgoing CFA difference for every call site, and unwind that way just that way. Otherwise revert to slow unwind at least until you get past the varying frames. Specifically, walk from the top, probing a cache for CFA delta + varying marker. If you make it all the way to the top with the cache, return call stack. If not, switch back to normal slow unwind, update cache, and go all the way to top. Alas, I have currently no idea how to identify alloca/vla-using frames. Any ideas?

Lassi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]