gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New ABI NSConstantString


From: David Chisnall
Subject: Re: New ABI NSConstantString
Date: Thu, 5 Apr 2018 20:33:47 +0100

This might be slightly confusing, because your mail client doesn’t seem to do 
anything sane for quoting:

On 5 Apr 2018, at 20:09, Stefan Bidigaray <address@hidden> wrote:
> 
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <address@hidden> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray <address@hidden> wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think 
> > this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the 
> > isa pointer? My thought here is that it can be checked for if different 
> > versions of the structure exist. This is important for CoreBase since it 
> > does not have the luxury of real classes.
> 
> I’m concerned with structure padding here.  Even on a 64-bit platform, we 
> either need an 8-byte flags field (which is wasteful) or end up with 4 bytes 
> of padding.  With 128-bit pointers (which are probably coming sooner than you 
> expect) we will end up with 12 bytes of padding if we have a 32-bit flags 
> field followed by a pointer.
> 
> Well, I was hoping there is a way we can define this structure so that it can 
> be used directly in CoreBase, without having to call the toll-free bridging 
> mechanism. If a 32-bit hash is used, could it be combined with the "flags" 
> variable (see the structure I included at the end of this email)? I'm hoping 
> to be able to have use the same constant strings without having to call the 
> bridging mechanism. It's pretty slow and cumbersome.

Can you explain why CoreBase needs to store the hash as anything other than a 
32-bit value that it can zero extend when returning a 64-bit value?  It the 
CoreFoundation and Foundation implementations of hash are compatible, then it 
will currently be returning a 28-bit value in a 64-bit register, so I don’t 
understand the issue here.

> 
> By the way, I noticed there was not uint32_t flags in your original 
> structure, making it 24 bytes in 32-bit CPUs.
> 
> > * Would it be possible to make the hash variable a NSUInterger? The output 
> > of -hash is an NSUInterger, and that would allow the value to be expanded 
> > in the future.
> 
> We can, though that would again increase the size quite noticeably.  I think 
> I’m happy with a 32-bit hash, because as rfm points out with a decent hash 
> algorithm that basically gives us unique hashes.
> 
> Sounds reasonable.
>  
> > * Why have both count and length? Would it not make more sense to keep a 
> > single variable here called count and define it as, "The count/number of 
> > code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 
> > it would be the # of 16-bit codes. The Apple documentation states "The 
> > number of UTF-16 code units in the receiver", making at least the ASCII and 
> > UTF-16 numbers correct. The way I understand the current implementation, 
> > the value for length would return the UTF-32 # of characters, which is 
> > inconsistent with the docs.
> 
> If a UTF-8 string contains multi-byte sequences, then the length of the 
> buffer and the number if UTF-16 code units will be different.  If we know the 
> number of bytes, then we can use more efficient C standard library functions 
> for things like comparisons, though that may not be important.
> 
> I guess I'm still a bit confused about the meaning and/or different of the 
> variables count and length.

One tells you the logical number of characters, the other the length of the 
buffer in bytes.  A lot of bytes-scanning functions are far more efficient if 
they know the length up front, because they can then process one word at a time 
until the last word.

> I know this is probably going to be rejected, but how about making constant 
> string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this 
> would increase the byte count for most European languages using Latin 
> characters, but I don't see the point of maintaining both UTF-8 and UTF-16 
> encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 
> (and vise-versa), so how would the compiler pick between the two? 
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code 
> significantly?

There’s also the issue that -UTF8String is one of the most commonly used 
methods on NSString, so if we represent something as UTF-16 internally then it 
needs converting and returning in an autoreleased buffer, whereas with a UTF-8 
string it can just return the pointer.  On non-Windows platforms, -UTF8String 
is the way of getting a string that you pass to pretty much any OS function.

> 
> > * I would also think that it makes more sense to have the length/count 
> > variable before the data pointer. I don't have a strong opinion about this 
> > one, but it just makes more sense in my head.
> 
> Again, this gives us more padding in the structure.
> 
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit 
> architectures (except WIN64)? I thought a long would not be padded any more 
> than a pointer for most applications.

Not Win64, not on anything with larger than 64-bit pointers.

> >
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient 
> > than our current one-at-a-time approach, but how much better is it to 
> > competing hash functions? Is there a bench mark out there comparing some of 
> > the major ones? For example, how does it compare with lookup3 or 
> > SpookyHash. If we are storing the hash in the string structure, the speed 
> > of calculating the hash is not as important as the spread. Additionally, 
> > Murmur3 seems ill suited if NSUInteger is used to store the hash value 
> > since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. 
> > Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words 
> > in the case of lookup3), as well.
> 
> The size of the type doesn’t necessarily give us the range.  We are 
> completely free to give only a 32-bit or even 28-bit range within an 
> NSUInteger (which is what we do now) and if we have good coverage.  A good 
> hash function has even distribution of entropy across all bits, so taking a 
> 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to 
> make the hash value 8 bytes on 64-bit platforms if this seems like a good use 
> of bits.
> 
> I’m not wedded to the idea of Murmur3.  We do need to use the same hash for 
> constant and non-constant strings, so execution speed is important.  I’m 
> somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate 
> with SSE and newer CPUs have full hardware offload for it.  That said, the 
> goal is not to mandate the use of the compiler-generated hash for constant 
> strings, it’s to provide a space to store one that the compiler initialises 
> to something sensible.
> 
> Given the analysis I’ve done in the reply to Ivan, I think it’s worth 
> consuming space to improve performance.
> 
> I agree.
> 
> So how about a structure like:
> 
> struct { 
>         id isa; /* Class pointer */ 
>         uint64_t flags;
>         /* Flags bitfield:
>            Low 2 bits, enum with values:
>            0: ASCII string
>            1: UTF-16 string
>            2 and 3: Reserved for future encodings
>            (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 
> reserved for future hashes
>            (1<<4) to (1<<15): Reserved for future compiler-defined flags
>            (1<<16) to (1<<31): Reserved for use by the constant string class 
> (I'm hoping this could hold the CFTypeID of a constant string so it can be 
> identified by corebase)
>            (1<<32) to (1<<63): hash
>         */
>         const char *data; /* Pointer to the buffer.  ro_data section, so 
> immutable.  NULL-terminated */
>         long count;  /* Number of UTF-16 code units, not including the null 
> terminator */
> }

I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and 
I don’t like the ordering (it will be annoying to have to order the fields 
differently on 128-bit pointer platforms).  I’m not convinced that it’s worth 
omitting the length to save 8 bytes per string.  It’s probably also not 
actually worth using longs for the length on 64-bit platforms, so both of these 
should probably be 32 bits.  4GB of string literal seems a bit excessive (for 
one thing, I doubt the compiler will be entirely happy with it, and I don’t 
know happy linkers are with 4GB symbols…).

David




reply via email to

[Prev in Thread] Current Thread [Next in Thread]