[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Libunwind-devel] Updated fast trace patch with initial performance
From: |
Lassi Tuura |
Subject: |
Re: [Libunwind-devel] Updated fast trace patch with initial performance results |
Date: |
Thu, 24 Mar 2011 12:56:14 +0100 |
Hey there,
I have additional patches to amend the previous ones for your review.
(A) Resets hash table use count to zero on expansion - bug fix.
(B) Performance optimisations to speed up x86_64 fast trace.
With these plus changing libunwind to compile with -O3 instead of -O2, the fast
trace now spends ~55 clock cycles per stack trace level in my tests. In the
format I used before, here are new BZ test results, against bare app NI and
previous patch BT.
| Time % VSIZE RSS Walks Depth Walk Time Per Stack Level
|-/NI 405s 100% 1548 1251 - - / - - / - - / -
|
|P/BT 410s 101% 1593 1273 68154 29.1 / 7.6 65k / 105k 3111 / 7838
|P/BZ 409s 101% 1593 1272 67867 29.1 / 7.7 61k / 101k 2941 / 7479
|
|M/BT 1251s 309% 2815 2501 387M 28.0 / 6.3 2313 / 2650 84.5 / 114.3
|M/BZ 1088s 269% 2812 2498 387M 28.0 / 6.3 1502 / 2570 55.0 / 222.4
(B) resulted from hardware profiling with Intel PTU. If you want you can browse
the source and assembly code with basic line-by-line PTU counts at (C). The
results are from an 12-core Intel Xeon L5640 2.27 GHz with hyper-threading on
("24 cores") and 24 GB RAM.
(B) has static branch prediction predicates. I hoped to avoid littering the
code by using -fprofile-{generate,use} but got link errors. It looks like the
GCC I used (4.3.4) isn't happy combining profile feedback and symbol
visibility, and generates bad relocations. I did verify manually many of the
static predicates - they do improve performance at least on my test systems.
It's time I shift focus to improve the internals of our own profiler now... Let
me know how you feel about these libunwind changes.
Regards,
Lassi
(A) Attached 01-reset-used-to-zero.patch
(B) Attached 02-performance-optimisations.patch
(C) PTU results for (B)
http://cern.ch/lat/cmssw/ptuview/igprof-bz/all/basic_sampling
Earlier progressive trials, click on functions for links to code used.
The runs differ from each other in length, still useful for comparisons.
http://cern.ch/lat/cmssw/ptuview/igprof-by/all/basic_sampling
http://cern.ch/lat/cmssw/ptuview/igprof-bx/all/basic_sampling
http://cern.ch/lat/cmssw/ptuview/igprof-bw/all/basic_sampling
01-reset-used-to-zero.patch
Description: Binary data
02-performance-optimisations.patch
Description: Binary data