libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Libunwind-devel] Question about performance of threaded access in libun


From: Robert Schöne
Subject: [Libunwind-devel] Question about performance of threaded access in libunwind
Date: Thu, 06 Oct 2016 12:55:52 +0200

Hello,

Could it be that unwinding does not work well with threading?

I run an Intel dual core system + Hyperthreading using Ubuntu 16.04.
and patched tests/Gperf-trace.c so that this part

  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_NONE);
  doit ("no cache        ");

  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_GLOBAL);
  doit ("global cache    ");

  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_PER_THREAD);
  doit ("per-thread cache");

is executed thread parallel:

  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_NONE);
#pragma omp parallel
{
  doit ("no cache        ");
}
  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_GLOBAL);
#pragma omp parallel
{
  doit ("global cache    ");
}
  unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_PER_THREAD);
#pragma omp parallel
{
  doit ("per-thread cache");
}


With this modification, the benchmark prints one line per active
thread. The number of threads can be set with the environment variable
OMP_NUM_THREADS. I compile the test with gcc Gperf-trace.c
-I../include/ -lunwind-x86_64 -lunwind -fopenmp -o Gperf-trace_omp

The original result that is also achieved with OMP_NUM_THREADS set to 1
is:

unw_getcontext : cold avg=   50.068 nsec, warm avg=   28.610 nsec
unw_init_local : cold avg=  138.283 nsec, warm avg=   38.147 nsec
no cache        : unw_step : 1st= 3589.648 min=  354.286 avg=  368.953
nsec
global cache    : unw_step : 1st=  523.630 min=  354.286 avg=  365.754
nsec
per-thread cache: unw_step : 1st=  532.542 min=  354.286 avg=  364.794
nsec

For every thread that I add, the reported times increase independently
of the setted caching method. 2 Threads:

no cache        : unw_step : 1st= 5454.929 min=  582.801 avg=  660.841
nsec
no cache        : unw_step : 1st= 5551.434 min=  393.719 avg=  656.303
nsec
global cache    : unw_step : 1st=  843.295 min=  582.801 avg=  652.083
nsec
global cache    : unw_step : 1st=  761.190 min=  393.719 avg=  648.359
nsec
per-thread cache: unw_step : 1st=  860.956 min=  593.839 avg=  658.977
nsec
per-thread cache: unw_step : 1st=  763.377 min=  402.468 avg=  654.147
nsec

3 Threads:
no cache        : unw_step : 1st= 6794.930 min=  501.121 avg= 1096.294
nsec
no cache        : unw_step : 1st= 7426.297 min=  631.368 avg= 1098.641
nsec
no cache        : unw_step : 1st= 6835.395 min=  393.719 avg= 1089.486
nsec
global cache    : unw_step : 1st= 1028.732 min=  702.010 avg= 1077.695
nsec
global cache    : unw_step : 1st= 1046.393 min=  399.572 avg= 1092.139
nsec
global cache    : unw_step : 1st= 1347.393 min=  393.719 avg= 1088.214
nsec
per-thread cache: unw_step : 1st= 2194.334 min=  554.102 avg= 1092.061
nsec
per-thread cache: unw_step : 1st= 1907.349 min=  565.140 avg= 1093.666
nsec
per-thread cache: unw_step : 1st= 1852.665 min=  393.719 avg= 1088.304
nsec

4 Threads:
no cache        : unw_step : 1st= 7991.438 min=  788.106 avg= 1282.086
nsec
no cache        : unw_step : 1st= 7962.739 min=  684.350 avg= 1291.780
nsec
no cache        : unw_step : 1st= 8368.934 min=  582.801 avg= 1299.171
nsec
no cache        : unw_step : 1st= 7705.951 min=  448.402 avg= 1289.624
nsec
global cache    : unw_step : 1st= 2055.256 min=  602.669 avg= 1276.634
nsec
global cache    : unw_step : 1st= 1165.602 min=  582.801 avg= 1285.326
nsec
global cache    : unw_step : 1st= 1582.834 min=  509.951 avg= 1297.648
nsec
global cache    : unw_step : 1st= 1054.291 min=  393.719 avg= 1286.160
nsec
per-thread cache: unw_step : 1st= 2203.164 min=  860.956 avg= 1284.144
nsec
per-thread cache: unw_step : 1st= 1249.490 min=  843.295 avg= 1292.000
nsec
per-thread cache: unw_step : 1st= 1119.243 min=  620.330 avg= 1297.951
nsec
per-thread cache: unw_step : 1st= 1559.564 min=  393.719 avg= 1288.791
nsec

Is this intended? We would like to use libunwind to gather the stack in
a multithreaded application but with this behavior I imagine that this
won't work without influencing the performance significantly (e.g.,
when 128 threads are used).

According to perf and strace a significant amount of time is spent in
the kernel, i.e. in sigprocmask.

Robert



reply via email to

[Prev in Thread] Current Thread [Next in Thread]