libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libunwind-devel] Updated fast trace patch with initial performance


From: Lassi Tuura
Subject: Re: [Libunwind-devel] Updated fast trace patch with initial performance results
Date: Thu, 24 Mar 2011 12:56:14 +0100

Hey there,

I have additional patches to amend the previous ones for your review.
(A) Resets hash table use count to zero on expansion - bug fix.
(B) Performance optimisations to speed up x86_64 fast trace.

With these plus changing libunwind to compile with -O3 instead of -O2, the fast 
trace now spends ~55 clock cycles per stack trace level in my tests. In the 
format I used before, here are new BZ test results, against bare app NI and 
previous patch BT.

|      Time  %     VSIZE  RSS   Walks     Depth     Walk Time   Per Stack Level
|-/NI   405s 100%  1548   1251  -         - / -       - / -         - / -
|
|P/BT   410s 101%  1593   1273  68154  29.1 / 7.6   65k / 105k   3111 / 7838
|P/BZ   409s 101%  1593   1272  67867  29.1 / 7.7   61k / 101k   2941 / 7479
|
|M/BT  1251s 309%  2815   2501  387M   28.0 / 6.3  2313 / 2650   84.5 / 114.3
|M/BZ  1088s 269%  2812   2498  387M   28.0 / 6.3  1502 / 2570   55.0 / 222.4

(B) resulted from hardware profiling with Intel PTU. If you want you can browse 
the source and assembly code with basic line-by-line PTU counts at (C). The 
results are from an 12-core Intel Xeon L5640 2.27 GHz with hyper-threading on 
("24 cores") and 24 GB RAM.

(B) has static branch prediction predicates. I hoped to avoid littering the 
code by using -fprofile-{generate,use} but got link errors. It looks like the 
GCC I used (4.3.4) isn't happy combining profile feedback and symbol 
visibility, and generates bad relocations. I did verify manually many of the 
static predicates - they do improve performance at least on my test systems. 

It's time I shift focus to improve the internals of our own profiler now... Let 
me know how you feel about these libunwind changes.

Regards,
Lassi

(A) Attached 01-reset-used-to-zero.patch
(B) Attached 02-performance-optimisations.patch
(C) PTU results for (B)

      http://cern.ch/lat/cmssw/ptuview/igprof-bz/all/basic_sampling

    Earlier progressive trials, click on functions for links to code used.
    The runs differ from each other in length, still useful for comparisons.

      http://cern.ch/lat/cmssw/ptuview/igprof-by/all/basic_sampling
      http://cern.ch/lat/cmssw/ptuview/igprof-bx/all/basic_sampling
      http://cern.ch/lat/cmssw/ptuview/igprof-bw/all/basic_sampling

Attachment: 01-reset-used-to-zero.patch
Description: Binary data

Attachment: 02-performance-optimisations.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]