tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] vectorize the curent hash implementation


From: KHMan
Subject: Re: [Tinycc-devel] vectorize the curent hash implementation
Date: Sat, 23 Apr 2016 19:24:56 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 4/23/2016 6:43 PM, Vladimir Vissoultchev wrote:
Rolling hashes have a nice property -- associativity -- i.e. hash(concat(a,
b)) = [hash(a) * TOK_HASH_PRIME^len(b) + hash(b)] in GF(TOK_HASH_SIZE)

What I tried based on this was to pre-hash all of file->buffer and calc hash
on current (pos, len) with no loop like this:

     h = (ph[len] - (ph[0] - TOK_HASH_INIT) * hash_prime_powers[len]) &
(TOK_HASH_SIZE - 1)

... where ph is pointing to current pos in pre-hash buffer.

Unfortunately complete file->buffer pre-hashing is slow, I'm currently
working on impl pre-hash part in SSE2 but it looks like it's going to be too
slow again as there is no performant way to calculate all intermediate
hashes.

Good luck, I would love to hear about the results. It's a difficult tradeoff.

These days one can sometimes get a big percentage of performance gain for some runs of an executable that are accidentally good for the caches. Change an env variable, memory moves a bit, performance swings wildly...

cheers,
</wqw>

-----Original Message-----
From: Tinycc-devel On Behalf Of KHMan
Sent: Saturday, April 23, 2016 12:56 PM
To: address@hidden
Subject: Re: [Tinycc-devel] vectorize the curent hash implementation

On 4/23/2016 4:26 PM, Vladimir Vissoultchev wrote:
[snip snip snip]
(Note: SSE calc loop oversteps input string so file buffer has to
overallocated by 8 bytes in `tcc_open_bf`)

Unfortunately there was no noticable improvement, which leads me to
beleave that hash calculation is not the main culprit for the
performance of `next_nomacro1`

IIRC there are fast LZ compressors that are trying to use vector registers
to speed up hashing. But those have nice and small inner loops.

But compilers branch all over the place, there are no nice inner loops to
let L1 caches go hot-rodding, there are many data dependencies which limit a
CPU's ability to schedule multiple ops.
Branch and cache mispredictions are like train crashes these days.

Perhaps someone can test the limits on one of those academic CPU simulators
that has an 'oracle' cache setting. How much speed can we further squeeze?
I'm not optimistic there is a lot of performance left to squeeze...

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia




reply via email to

[Prev in Thread] Current Thread [Next in Thread]