bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?


From: Andrew Boling
Subject: Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?
Date: Tue, 2 Sep 2014 15:18:14 -0400

I wrote the grapheme break functions.  It didn't occur to me that it would be
useful to return anything, because usually the breakpoints are scanned to
find good places to break, and usually those are pretty common.

It's probably not a common use case (otherwise someone would have said the same thing about the _wordbreaks series already), but I'll elaborate a little bit to help demonstrate an applicable scenario.

The strings my functions operate on are arrays in memory with associated link counts. The original code used random access to perform string manipulation, but that's not a valid approach when n_bytes != n_codepoints (non-ASCII). The new approach I'm using is to pre-generate the grapheme breaks when the string is instantiated (u8_wordbreaks). This way the break positions are only calculated once across the life of that string. Knowing the grapheme count is beneficial here as the operation can be immediately rejected without an additional scan.

If the string is modified, that instantiates a completely new string and reduces the link count of the string that was operated on by one. (potentially freeing the old string and its associated grapheme breaks array)



On Tue, Sep 2, 2014 at 2:09 PM, Ben Pfaff <address@hidden> wrote:
On Mon, Sep 1, 2014 at 2:24 PM, Andrew Boling <address@hidden> wrote:
> The _wordbreaks and _grapheme_breaks functions, while useful, currently
> return void instead of the number of breaks written to the output array. Is
> there a reason why it would be inappropriate to return the number of breaks
> (or number of clusters) in this context? I'm not opposed to scanning the
> result buffer to determine this information, but the second pass strikes me
> as unnecessary.

I wrote the grapheme break functions.  It didn't occur to me that it would be
useful to return anything, because usually the breakpoints are scanned to
find good places to break, and usually those are pretty common.

> In my particular case I need to split strings at grapheme boundaries based
> on user supplied integers, and it would make sense to skip the operation
> entirely if (n >= array_units || n >= grapheme_clusters).

I guess that if this is a common need (I do not really understand your
application) then returning the number of breaks would make sense.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]