[Libunwind-devel] libunwind x86-64 optimisations?

libunwind-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Libunwind-devel] libunwind x86-64 optimisations?

From:	Lassi Tuura
Subject:	[Libunwind-devel] libunwind x86-64 optimisations?
Date:	Mon, 6 Jul 2009 13:24:14 +0200

Hi,

I am one of the maintainers of a certain performance and memory useprofiling package. We've worked hard at providing a low-overheadprofiler for big software, and think we have done a fairly decent jobon IA32 Linux. (It's called igprof.)


In order to support x86-64, I've looked at using libunwind 0.99.

It seems to work mostly for us, which is a relief, but I have a fewconcerns and patches and would really welcome your feedback.

#1) libunwind seems to be reliable but not 100% async signal safe. Inparticular if called from signal handler (SIGPROF) at an inopportunetime it may dead-lock. Specifically, if we get a profiling signalexactly when dynamic linker is inside pthread_mutex_* or is alreadyholding a lock, and libunwind calls into dl_iterate_phdr() (NB; fromthe same thread already holding a lock or trying to change it), badthings will happen, usually a dead-lock.

I'm currently entertaining the theory that either a crash fromwalking the elf headers in memory (without dl_iterate_phdr() and itslocks) is less likely to crash than dead-locking inside the dynamiclinker, or I should try to discard profile signals while inside thedynamic linker.


Thoughts?

#2) libunwind appears to make heavy use of sigprocmask(), e.g. aroundevery mutex lock/unlock operation. This causes a colossal slow-downwith two syscalls per trip. Removing those syscalls makes a massiveperformance difference, but I assume they were there for a reason?(Included in patch 2.)

#3) libunwind resorts to scanning /proc/self/maps if it can't find anexecutable elf image (find_binary_for_address). On a platform we usea lot (RHEL4-based) this happens always for the main executable,causing a trip to scan /proc/self/maps for every stack level in themain program, which is, ahem, slow :-) I made a hack-ish (not threadsafe) fix for this: one to determine the value just once, and foranother, using readlink(/proc/self/exe) instead of scanning /proc/self/maps. I've never seen a null pointer for anything other than themain program. Can that really happen in other circumstances? Do youdo it the present way because of some JIT situation? (Included inpatch 2.)

#4) We appear to be blessed with a lot of libraries which haveinsufficient dwarf unwind info, and by the looks of it, the RHEL4GLIBC in particular. It looks like dwarf_find_save_locs() cachesrecent uses, but falls back on a full search if not found in cache.It turned out in our case the cache was hardly used, because itdidn't cache negative results. I added some code to rs_new() toremember "bogus" (negative) results, and code in dwarf_find_save_locs() to cache negative replies from fetch_proc_info() andcreate_state_record_for(), and it did help performance a lot. HoweverI am really unsure if I did it correctly and would appreciate anotherpair of eyes over this. (Included in patch 2, at the bottom.)

With all these changes libunwind is better, but we still have veryheavy penalty for unwinding; it's about 150 times slower than ourIA-32 unwinding code, and about 5-10 times slower in real-world apps(tracking every memory allocation at ~ 1M allocations per second). Ido realise x86-64 is much harder to unwind, but I am looking for waysto optimise it further; some ideas below.

I realise our situation is a little bit pushing it, what withmultiple threads and ~200 MB of code and VSIZE of 1-3GB.


Patches:

1) C++ compatibility clean-up

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-cleanup.patch?revision=1.1&view=markup


2) Various performance optimisations

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/COMP/CMSDIST/libunwind-optimise.patch?revision=1.5&view=markup


Ideas:

1) Use bigger caches (DWARF_LOG_UNW_CACHE_SIZE).

2) Try to determine which frames are "varying", i.e. uses VLAs oralloca(). If there are none in the call stack, just cache incomingCFA vs. outgoing CFA difference for every call site, and unwind thatway just that way. Otherwise revert to slow unwind at least until youget past the varying frames. Specifically, walk from the top, probinga cache for CFA delta + varying marker. If you make it all the way tothe top with the cache, return call stack. If not, switch back tonormal slow unwind, update cache, and go all the way to top. Alas, Ihave currently no idea how to identify alloca/vla-using frames. Anyideas?


Lassi

[Prev in Thread]

Current Thread

[Next in Thread]

[Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura <=
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
  - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Daniel Jacobowitz, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Richard Henderson, 2009/07/06
    - Re: [Libunwind-devel] libunwind x86-64 optimisations?, Lassi Tuura, 2009/07/06
- Re: [Libunwind-devel] libunwind x86-64 optimisations?, Arun Sharma, 2009/07/07

Next by Date: Re: [Libunwind-devel] libunwind x86-64 optimisations?
Next by thread: Re: [Libunwind-devel] libunwind x86-64 optimisations?
Index(es):
- Date
- Thread