libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libunwind-devel] [PATCH 0/1] Fast back-trace for x86_64 for only


From: Lassi Tuura
Subject: Re: [Libunwind-devel] [PATCH 0/1] Fast back-trace for x86_64 for only collecting the call stack
Date: Thu, 27 May 2010 22:03:21 +0200
User-agent: SquirrelMail/1.4.19

Hi,

> Paul and I looked into this some more.
>
> What do you think about hiding the API details behind the backtrace()
> implementation in
>
> src/mi/backtrace.c
>
> It could attempt a fast backtrace and then fallback to a slower, but
> more general backtrace.

This is fine in itself. How do you suggest to technically handle a frame
cache that survives across calls, such that the caller has enough control
to make the cache big enough to hit almost always after a warm-up period?

I did for example play around with an idea of extra unw_set_foo() calls
after unw_local_init() but before starting trace itself. I wanted to get
this patch first out for discussion, so I didn't fully explore that.

To give you an idea of the problem I am trying to solve, our applications
load 250-300 MB from about 600 shared libraries and have somewhere around
600'000 symbols present in the process image. Any given profile of ours is
likely to include at least 50'000 unique call sites. So let's round to
O(1M) FDEs, 10-20% hit, we attempt backtrace average 1M times per second,
and our applications run from anywhere from a minute to a day. Most of our
profiled apps have just one worker thread. Some have up to 100 active
threads, but then tend to load much less code and have fewer call sites.

I am basically happy with any API which lets us eliminate redundant work
across backtrace calls, and which others find agreeable. Shaving 100 clock
cycles off one trace already produces visible improvement for us.

> It may make sense to expose backtrace_{fast,slow} for users who want
> more control over which implementation gets called.
>
> Paul did some profiling of this code a few months ago and noticed that
> memcpy and unw_init_local() showed up high. Is it possible to avoid
> them? Perhaps unw_init_local() could be made cheaper by initializing
> only the subset of the state necessary for backtrace_fast().

I am all ears for any ideas for improvement. Obviously the whole idea of
the fast trace is to use minimum amount of state. Because it calls
dwarf_step() the context must remain valid and tends to get clobbered. I
am not sure it is possible to return the context back to caller clean
again.

However since we want the fast path to effectively work always - otherwise
our profiler experiences a horrible slowdown very much visible to the user
- we could in fact remove the memcpy() and simply force the caller to call
unw_local_init() again in the slow path. That would optimise the fast path
on the assumption it's very likely to succeed.

Yes the fast trace would indeed benefit from lighter-weight local init. On
86_64 it really only needs RBP, RSP and RIP. Folding this into backtrace()
would be nice in that it could hide this distinction between light/heavy
initialisation, provided we find a way to transmit the frame cache info.

Regards,
Lassi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]