gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hash computation and TFB


From: Luboš Doležel
Subject: Re: Hash computation and TFB
Date: Tue, 06 Aug 2013 15:36:52 +0200
User-agent: Roundcube Webmail/0.5

Yes, I've just noticed that once I force using UTF-16 in CFStringHash(), then -hash and CFStringHash() give the same value. The question is if it holds for all other bridged types.

Until a better/permanent solution is found, do you think the changes forcing UTF-16 in CFStringHash() are acceptable? I'm currently having problems implementing IOKit, because CFDictionary doesn't return the values for keys I give to it :-(

Luboš

On Tue, 6 Aug 2013 08:30:10 -0500, Stefan Bidi wrote:
I copied the hash algorithm straight out of -base, so they should
match.  I remember a few months ago Richard was playing around with
hash functions and this might be causing some issues, now.

I just looked it up, the changes were made on rev 36344.

There is another issue... -base allows UTF-8 strings, which will not
be hashed to the same UTF-16 value.  In my opinion, allowing UTF-8
string literals is not a good idea and base should revert back to
Latin1 as the default C string encoding.  I'm actually debating
adding a UTF-16 string literals configure option for corebase.  I
believe using UTF-16 internally is the only sane solution to non-ASCII
encodings.

I've tried experimenting with other hash functions that are not
one-at-a-time, but unfortunately have not found anything that will
work on both ASCII and Unicode strings consistently.  It would be
really nice to be able to work with 32- or 64-bit integers directly
instead of 8- or 16-bit characters.  If could use UTF-16 across the
board, this wouldn't be a problem.

Anyway, those are my thoughts.

On Tue, Aug 6, 2013 at 8:14 AM, Luboš Doležel  wrote:

Hello,

hash computation with Toll-Free Bridging is a tricky subject. Do
it wrong and you'll get all sorts of trouble, especially with
dictionaries, which use hashes a lot.

The code in corebase currently dispatches all CFHash() calls on
ObjC objects to -hash, which is bad. The following expectation
breaks due to this dispatch:

CFHash(@"string") == CFHash(CFSTR("string"))

because NSString uses a different hashing algorithm than CFString.
My suggestion is to do away with the ObjC dispatch in CFHash() and
alter all the CF*Hash() functions to support ObjC types.

While looking at CFStringHash(), I've also noticed that either
8-bit or 16-bit raw character data is used for hashing based on
what
is available. I believe this breaks the following case:

===
CFStringRef str1 = CFSTR("str");
CFStringRef str2 = CFStringCreateWithCharacters(NULL, (UniChar*)
"str", 3); // "str" in UTF-16

CFHash(str1) == CFHash(str2);
===

While the two strings are obviously identical, different bytes are
used to generate the hash in both cases.

This problem can by solved by converting the character data to
Unicode first, which has a performance impact, but only once for
every CFString.

The situation with CFHash() calls on NSStrings is worse, since
corebase has nowhere to save the calculated hash, so it must be
recalculated every time. But I think it's better to be slow than to
be wrong. Please review the attached patch and let me know if you
have any observations.

--
Luboš Doležel



Links:
------
[1] mailto:address@hidden

--
Luboš Doležel



reply via email to

[Prev in Thread] Current Thread [Next in Thread]