bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq i18n implementation


From: Paul Eggert
Subject: Re: uniq i18n implementation
Date: Thu, 10 Aug 2006 10:38:47 -0700
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)

>>>Using strcoll is inefficient anyway
>> 
>> Don't we know it!  If we can avoid it, we'd like to.
>
> Well, the mbstowcs+wcscoll solution I presented
> should be equivalent to strcoll on any platform,
> and it's much faster in my tests.

That's good to know, though I'm puzzled as to why it's true.  For a
single comparison, can't strcoll typically return an answer without
examining all the input, and wouldn't that be faster than
mbstowc+wcscoll?

But if it is true, perhaps we should rewrite memcoll to use the
mbstowc+wcscoll combination as well.

>>>but it probably is possible in ICU?
>> 
>> Sorry, don't know.
>
> I wonder could we add this as a dependency?

You mean, ship ICU code?  Or depend on it already being installed?

Sorry, I'm not familiar with the ICU code.  Is it free software and is
it well maintained?  Where else is it being used, outside ICU itself?

> Also I don't agree with splitting entities into
> valid multibyte ranges and "C" for the rest.
> That is probably not what the user wants the data interpreted as,
> and I think (at least for uniq which I've thought about),
> that it's just best to treat the whole entity as "C"
> if there are invalid multibyte sequences in the entity.

We can't adopt this approach in general, since it would mean that our
comparison operation could return inconsistent answers.  Suppose "Y"
has an invalid byte sequence but "X" and "Z" are valid.  Then we might
have "X" < "Y" < "Z" (using C-locale comparison), but "Z" < "X" (using
some other locale's comparison).  This will lead to inconsistencies,
which will be hard to document and will confuse users.  Worse, it can
even lead to buffer overruns: e.g., qsort has undefined behavior if
you pass it a comparison function that is not a total order.

In contrast, the approach that I suggested, though perhaps slower, is
consistent: it always results in a total order.  Perhaps it could be
improved on, though -- I haven't thought about performance as much as
you have.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]