[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: New ABI NSConstantString
From: |
David Chisnall |
Subject: |
Re: New ABI NSConstantString |
Date: |
Thu, 5 Apr 2018 18:41:20 +0100 |
On 5 Apr 2018, at 17:27, Stefan Bidigaray <address@hidden> wrote:
>
> Hi David,
> I forgot to make a comment when you originally posted the idea, and I think
> this would be a great time to add my 2 cents.
>
> Regarding the structure:
> * Would it not be better to add the flags bit field immediately after the isa
> pointer? My thought here is that it can be checked for if different versions
> of the structure exist. This is important for CoreBase since it does not have
> the luxury of real classes.
I’m concerned with structure padding here. Even on a 64-bit platform, we
either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of
padding. With 128-bit pointers (which are probably coming sooner than you
expect) we will end up with 12 bytes of padding if we have a 32-bit flags field
followed by a pointer.
> * Would it be possible to make the hash variable a NSUInterger? The output of
> -hash is an NSUInterger, and that would allow the value to be expanded in the
> future.
We can, though that would again increase the size quite noticeably. I think
I’m happy with a 32-bit hash, because as rfm points out with a decent hash
algorithm that basically gives us unique hashes.
> * Why have both count and length? Would it not make more sense to keep a
> single variable here called count and define it as, "The count/number of code
> units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would
> be the # of 16-bit codes. The Apple documentation states "The number of
> UTF-16 code units in the receiver", making at least the ASCII and UTF-16
> numbers correct. The way I understand the current implementation, the value
> for length would return the UTF-32 # of characters, which is inconsistent
> with the docs.
If a UTF-8 string contains multi-byte sequences, then the length of the buffer
and the number if UTF-16 code units will be different. If we know the number
of bytes, then we can use more efficient C standard library functions for
things like comparisons, though that may not be important.
> * I would also think that it makes more sense to have the length/count
> variable before the data pointer. I don't have a strong opinion about this
> one, but it just makes more sense in my head.
Again, this gives us more padding in the structure.
>
> Regarding the hash function:
> Why are we using Murmur3 hash? I know it is significantly more efficient than
> our current one-at-a-time approach, but how much better is it to competing
> hash functions? Is there a bench mark out there comparing some of the major
> ones? For example, how does it compare with lookup3 or SpookyHash. If we are
> storing the hash in the string structure, the speed of calculating the hash
> is not as important as the spread. Additionally, Murmur3 seems ill suited if
> NSUInteger is used to store the hash value since, as far as I could tell, it
> only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example,
> output 64-bit values (2 32-bit words in the case of lookup3), as well.
The size of the type doesn’t necessarily give us the range. We are completely
free to give only a 32-bit or even 28-bit range within an NSUInteger (which is
what we do now) and if we have good coverage. A good hash function has even
distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and
truncating it is fine. That said, I’m happy to make the hash value 8 bytes on
64-bit platforms if this seems like a good use of bits.
I’m not wedded to the idea of Murmur3. We do need to use the same hash for
constant and non-constant strings, so execution speed is important. I’m
somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with
SSE and newer CPUs have full hardware offload for it. That said, the goal is
not to mandate the use of the compiler-generated hash for constant strings,
it’s to provide a space to store one that the compiler initialises to something
sensible.
Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming
space to improve performance.
David
- New ABI NSConstantString, David Chisnall, 2018/04/01
- Re: New ABI NSConstantString, Fred Kiefer, 2018/04/01
- Re: New ABI NSConstantString, David Chisnall, 2018/04/01
- Re: New ABI NSConstantString, Richard Frith-Macdonald, 2018/04/01
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/05
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString,
David Chisnall <=
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/06
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/06
- Re: New ABI NSConstantString, David Chisnall, 2018/04/07
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/07
- Re: New ABI NSConstantString, David Chisnall, 2018/04/07
- Re: New ABI NSConstantString, Richard Frith-Macdonald, 2018/04/07
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/07