gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq


From: David Chisnall
Subject: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Date: Sun, 8 Apr 2018 12:33:05 +0100

On 8 Apr 2018, at 09:38, Fred Kiefer <address@hidden> wrote:
> 
> 
> 
>> Am 07.04.2018 um 20:51 schrieb David Chisnall <address@hidden>:
>> 
>> I am testing out a new version of the compiler / runtime that is producing 
>> NSConstantString instances with UTF-16 data.  I have currently disabled a 
>> lot of the NSConstantString optimisations, on the basis of ‘make it work 
>> then make it fast’ and I’m still seeing quite a lot of test failures.  The 
>> most recent ones seem to come from the fact that GSUnicodeString’s 
>> implementation of rangeOfComposedCharacterSequenceAtIndex: calls 
>> rangeOfSequence_u(), which returns a different range to NSString’s 
>> implementation.
>> 
>> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) 
>> from the NSString/test00.m. If I call -getCharacters:range: on both, then I 
>> get the same set of characters for [indianLong length] characters.  This is 
>> as expected.  When searching for indianLong in ls, it is not found.  
>> Sticking in a lot of debugging code, I eventually tracked it down to this 
>> disagreement and when I comment out GSUnicodeString’s implementation of 
>> rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass 
>> implementation then this test passes.
>> 
>> Please can someone who understands these bits of exciting unicode logic take 
>> a look and see if there’s any reason for the disagreement?
> 
> I am surely no expert here, but I had a quick look at the code and the two 
> algorithms seem to be very similar. The only difference is the set of code 
> points that the characters get compared to. NSString uses [NSCharacterSet 
> nonBaseCharacterSet], which looks correct to me. On the other hand GSString 
> uses uni_isnonsp(), which I would read as "non spacing“ but is never 
> explained. The code here is as follows: 
> 
> BOOL
> uni_isnonsp(unichar u)
> {
>  /*
>   * Treating upper surrogates as non-spacing is a convenient solution
>   * to a number of issues with UTF-16
>   */
>  if ((u >= 0xdc00) && (u <= 0xdfff))
>    return YES;
> 
> // FIXME check is uni_cop good for this
>  if (GSPrivateUniCop(u))
>    return YES;
>  else
>    return NO;
> }
> 
> As a side effect this should handle the upper surrogates correctly, but not 
> the lower and I have no idea what GSPrivateUniCop does, even after looking at 
> the code various times. OK, it is a binary search on uni_cop_table, but what 
> is in that table?

Thanks.  I co-mentored a GSoC student a couple of years ago working on unicode 
collation and discovered quite how little I understand about unicode. I am 
reminded of that now:

[NSCharacterSet nonBaseCharacterSet] contains the characters in the Mark sets, 
which are mark sets, which appear to be the combining diacritics.  I don’t 
believe that the current code correctly handles other combining sets, such as 
the emoji flag sequences, which are not in the Mark sets.  A simple test 
demonstrates this:

```objc
#import <Foundation/Foundation.h>

int main(void)
{
        @autoreleasepool {
        uint32_t utf32[] = {0x1F1E7, 0x1F1EA};
        NSString *str = [[NSString alloc] initWithBytes: utf32 length: 
sizeof(utf32) encoding: NSUTF32LittleEndianStringEncoding];
        NSLog(@"%@, %ld", str, (unsigned long)[str 
rangeOfComposedCharacterSequenceAtIndex: 0].length);
        }
}
```

On macOS, this prints a Belgian flag and tells me that it is 4 UTF-16 code 
units long.  On GNUstep for me it prints the flag but tells me that the length 
is 1 with the version from NSString and 2 with the version from GSString.

For NSRegularExpression, I wrote a pair of adaptors classes that allow you to 
expose either the NSString or ICU UText interfaces using the other’s 
representation.  I wonder if we should consider moving to something like that 
on platforms that support ICU?  I’m not sure where the UText interface comes 
from, but it seems to now be increasingly widely used inside ICU and could have 
been designed specifically to support the abstractions that NSString exposes 
(it requires that you be able to access the string as UTF-16 code units and 
works by providing a buffer that is filled using a mechanism very similar to 
-getCharacters:range:).  I have implemented a simple function that uses ICU to 
DTRT:

```objc
NSRange rangeOfComposedCharacterSequenceAtIndex(NSString *str, NSUInteger idx)
{
        UText *txt = UTextInitWithNSString(NULL, str);
        UErrorCode e = 0;
        UBreakIterator *it = ubrk_open(UBRK_CHARACTER, NULL, NULL, 0, &e);
        ubrk_setUText(it, txt, &e);
        if (U_FAILURE(e))
        {
                NSLog(@"Error handling goes here");
        }
        ubrk_preceding(it, idx);
        uint32_t origin = ubrk_current(it);
        uint32_t next = ubrk_next(it);
        if (next == idx)
        {
                origin = next;
                next = ubrk_next(it);
        }
        if (next == UBRK_DONE)
        {
                [NSException raise: NSRangeException format:@"Invalid 
location."];
        }
        return NSMakeRange(origin, next - origin);
}
```

Creating a new UBreakIterator each call is expensive, but we could cache it in 
TLS.  Similarly, we can allocate the UText object (and its associated buffer) 
on the stack.  I’d expect that to make this operation fairly cheap, but I don’t 
have any Objective-C code where unicode string rendering is on the critical 
path so I can’t usefully measure it.  I’ve tested this with a flag and with an 
a + ogonek + acute (U+0061, U+0328, U+0301) and it gives the correct range in 
both cases.

The existing implementation appears to have another bug, in that it always 
gives you a range that starts from the index that you give it, even if that’s 
in the middle of a composed sequence.  This version does the right thing in all 
cases.

I’m also wondering if we should consider replacing GSString with an extended 
version of GSICUString on platforms where ICU is available?

David

Full test program:

#import <Foundation/Foundation.h>
#import "Source/GSICUString.h"
#include <unicode/ubrk.h>

NSRange rangeOfComposedCharacterSequenceAtIndex(NSString *str, NSUInteger idx)
{
        UText *txt = UTextInitWithNSString(NULL, str);
        UErrorCode e = 0;
        UBreakIterator *it = ubrk_open(UBRK_CHARACTER, NULL, NULL, 0, &e);
        ubrk_setUText(it, txt, &e);
        if (U_FAILURE(e))
        {
                NSLog(@"Error handling goes here");
        }
        ubrk_preceding(it, idx);
        uint32_t origin = ubrk_current(it);
        uint32_t next = ubrk_next(it);
        if (next == idx)
        {
                origin = next;
                next = ubrk_next(it);
        }
        if (next == UBRK_DONE)
        {
                [NSException raise: NSRangeException format:@"Invalid 
location."];
        }
        return NSMakeRange(origin, next - origin);
}

int main(void)
{
        @autoreleasepool {
                uint32_t utf32[] = {0x1F1E7, 0x1F1EA, 0x20, 0x061, 0x0328, 
0x0301};
                NSString *str = [[NSString alloc] initWithBytes: utf32 length: 
sizeof(utf32) encoding: NSUTF32LittleEndianStringEncoding];
                NSLog(@"%@, %ld", str, (unsigned long)[str 
rangeOfComposedCharacterSequenceAtIndex: 0].length);
                str = [str stringByAppendingString: @"foo bar wibble!"];
                for (int i=0 ; i<=[str length] ; i++)
                {
                        NSLog(@"new: %d: %@", i, 
NSStringFromRange(rangeOfComposedCharacterSequenceAtIndex(str, i)));
                        NSLog(@"old: %d: %@", i, NSStringFromRange([str 
rangeOfComposedCharacterSequenceAtIndex: i]));
                }
        }
}




reply via email to

[Prev in Thread] Current Thread [Next in Thread]