libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libunwind-devel] libunwind with LD_PRELOAD option


From: Lassi Tuura
Subject: Re: [Libunwind-devel] libunwind with LD_PRELOAD option
Date: Mon, 5 Sep 2011 20:40:59 +0200

Hi,

> Here is the malloc example
> void *malloc(size_t size)
> {
>     void *p;
> 
>     if(!origMallocFp)
>         getInstance(); => This is done with pthread_once.

If I understand your quote correctly, this may end up calling dlsym(), which 
may internally call malloc(). You are not really pasting enough code here to 
tell for sure that you code is problem free; it's hard to reason about the code 
based on the information at hand. You might want to review your code with a 
very critical eye on all calls.

>         formStackPacket (packet+pktHdrLen, (unsigned int *)bt, numEntries);
>         rssSend (packet, sizeof (unsigned int) * (numEntries + pktHdrLen));

Black boxes, hard to say what they do. Could they allocate memory or otherwise 
end in trouble?

It could be something as simple as these bits have an error path which gets 
fired when you send more data with the full stack trace, and the error path 
does some memory allocation. Without stack tracing you might never hit the 
error path.

> Yes, i experienced this when i first tried with glibc backtrace() and also 
> printfs when i first started.
> Hence i removed all that and this works fine. For days together i can profile 
> the app and get the stats.
> Without stack trace, this is only half the job done and teams take longer to 
> find the exact place of leak :-(

Unfortunately that doesn't say much. You could just be lucky and not call 
anything which triggers problems. For example if you add stacks to your network 
stuff, maybe it exceeds some threshold and does some allocation, or hits an 
error path you don't otherwise trigger, or ...?

> Also, when i don't link with -lunwind, the code is stable. I have tried with 
> different versions of the app and it is consistent. 
> So there is no recursive malloc hazard without unwind for sure.

It's a data point, but could just be circumstantial. It's hard to say for sure 
from data.

> Great. Did you use LD_PRELOAD trick? It is so appealing because of it's ease 
> of instrumentation.

No, we inject a hook into functions by rewriting the function prologue on the 
fly.

>  That's why i am not giving up yet to get the backtrace. The target is a 
> small device with nand based filesystem and cannot hold huge data. Hence i 
> send it to host for post processing.

Let me throw a few ideas here, though extensive follow-up would probably better 
be off the libunwind list.

On x86-64 we use libunwind to capture stack trace (ia32 uses something else) on 
every allocation. Each allocation is associated to its full stack trace, and we 
can dump this "heap snapshot" at any time during running, or at the end as a 
final profile result. We use these for leak checking, identifying peak use, 
general allocation profiling, correlating performance and allocation behaviour, 
looking for churn, delta comparisons between runs/versions, fragmentation and 
locality studies, etc. The heap snapshots are many orders of magnitude smaller 
than the entire stream of stack traces on allocation would be.

The applications we profile generate prodigious number of allocation samples, 
on average 40 levels deep stacks from 700 or so shared libraries, 1-3 million 
times a second. It's not unusual we track ~7-10 million concurrently live 
allocations. The apps run anywhere from ~15 minutes to 24 hours.

Long long time ago we use to generate a serialised stream of stack traces, like 
you appear to do, then absorb it in a collector to summary. We moved away from 
doing that because there was no way to deal with the data stream at the rate it 
was produced, even if the consumer was multi-threaded and used numerous tricks 
to speed up consuming the stack trace data. But maybe your data rate isn't as 
high as ours...

We've settled on a data structure which is moderate enough in extra size (= 
needs <100% extra virtual memory) and is fast enough to update (~140% run time 
increase at 1MHz, vs. x10-20 for valgrind), and handles multi-threaded apps 
too. If allocation rate is less fanatic, the overhead is less, much less. The 
heap snapshots are very manageable size, about 30MB compressed per 1-2 GB of 
VSIZE.

I don't know what sort of constraints you have on your target device, or what 
your target app's behaviour is, but my experience was that summarising the 
allocation data in-process virtual memory was by far the winner. YMMV, much 
depends on how much extra RAM you can expend, and what sort of allocation rate 
you experience, and other factors.

Regards,
Lassi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]