[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Libunwind-devel] Another optimisation for x86-64 fast trace
From: |
Arun Sharma |
Subject: |
Re: [Libunwind-devel] Another optimisation for x86-64 fast trace |
Date: |
Wed, 30 Mar 2011 11:51:16 -0700 |
On Wed, Mar 30, 2011 at 8:05 AM, Lassi Tuura <address@hidden> wrote:
> For completeness, perhaps I should mention that I also tested with ".p2align
> 2" and ".p2align 4" right before ".global _Ux86_64_getcontext_trace". The
> results started to be slightly sporadic, but curiously all the aligned
> versions were slightly but systematically slower than the unaligned one (by
> ~1-2%).
>
> The function is definitely unaligned with the patch, at offset 0x4e09 into
> the shared library in my case.
>
These are usually related to how the x86 decoder works on your CPU. On
Nehalem/Westmere generation it fetches bundles of 16 bytes and decodes
up to 3 simple and one complex uop. There are a lot of interesting
stories about how inserting or removing a nop from a hot loop changes
throughput significantly.
-Arun