If I understand your quote correctly, this may end up calling dlsym(), which may internally call malloc(). You are not really pasting enough code here to tell for sure that you code is problem free; it's hard to reason about the code based on the information at hand. You might want to review your code with a very critical eye on all calls.
> Here is the malloc example
> void *malloc(size_t size)
> void *p;
> getInstance(); => This is done with pthread_once.
Black boxes, hard to say what they do. Could they allocate memory or otherwise end in trouble?
> formStackPacket (packet+pktHdrLen, (unsigned int *)bt, numEntries);
> rssSend (packet, sizeof (unsigned int) * (numEntries + pktHdrLen));
It could be something as simple as these bits have an error path which gets fired when you send more data with the full stack trace, and the error path does some memory allocation. Without stack tracing you might never hit the error path.
Unfortunately that doesn't say much. You could just be lucky and not call anything which triggers problems. For example if you add stacks to your network stuff, maybe it exceeds some threshold and does some allocation, or hits an error path you don't otherwise trigger, or ...?
> Yes, i experienced this when i first tried with glibc backtrace() and also printfs when i first started.
> Hence i removed all that and this works fine. For days together i can profile the app and get the stats.
> Without stack trace, this is only half the job done and teams take longer to find the exact place of leak :-(
It's a data point, but could just be circumstantial. It's hard to say for sure from data.
> Also, when i don't link with -lunwind, the code is stable. I have tried with different versions of the app and it is consistent.
> So there is no recursive malloc hazard without unwind for sure.
No, we inject a hook into functions by rewriting the function prologue on the fly.
> Great. Did you use LD_PRELOAD trick? It is so appealing because of it's ease of instrumentation.
Let me throw a few ideas here, though extensive follow-up would probably better be off the libunwind list.
> That's why i am not giving up yet to get the backtrace. The target is a small device with nand based filesystem and cannot hold huge data. Hence i send it to host for post processing.
On x86-64 we use libunwind to capture stack trace (ia32 uses something else) on every allocation. Each allocation is associated to its full stack trace, and we can dump this "heap snapshot" at any time during running, or at the end as a final profile result. We use these for leak checking, identifying peak use, general allocation profiling, correlating performance and allocation behaviour, looking for churn, delta comparisons between runs/versions, fragmentation and locality studies, etc. The heap snapshots are many orders of magnitude smaller than the entire stream of stack traces on allocation would be.
The applications we profile generate prodigious number of allocation samples, on average 40 levels deep stacks from 700 or so shared libraries, 1-3 million times a second. It's not unusual we track ~7-10 million concurrently live allocations. The apps run anywhere from ~15 minutes to 24 hours.
Long long time ago we use to generate a serialised stream of stack traces, like you appear to do, then absorb it in a collector to summary. We moved away from doing that because there was no way to deal with the data stream at the rate it was produced, even if the consumer was multi-threaded and used numerous tricks to speed up consuming the stack trace data. But maybe your data rate isn't as high as ours...
We've settled on a data structure which is moderate enough in extra size (= needs <100% extra virtual memory) and is fast enough to update (~140% run time increase at 1MHz, vs. x10-20 for valgrind), and handles multi-threaded apps too. If allocation rate is less fanatic, the overhead is less, much less. The heap snapshots are very manageable size, about 30MB compressed per 1-2 GB of VSIZE.
I don't know what sort of constraints you have on your target device, or what your target app's behaviour is, but my experience was that summarising the allocation data in-process virtual memory was by far the winner. YMMV, much depends on how much extra RAM you can expend, and what sort of allocation rate you experience, and other factors.