bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: coreutils and i18n


From: Pádraig Brady
Subject: Re: coreutils and i18n
Date: Mon, 21 Apr 2008 12:53:54 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bruno Haible wrote:
> Jim Meyering wrote:
>>>   - Processing in unibyte locales should not become significantly slower
>>>     than before.
>>>   - Code duplication should be avoided, for maintainability.
>>>   - Macros which expand to one thing in the multibyte case and to another
>>>     thing for the unibyte case are not acceptable.
>>>
>>> How will this students' project solve this dilemma?
>> There's no guarantee, but Paul and I will be supervising.
> 
> I mean, what is technically the solution to the dilemma? The typical idiom
> for keeping the speed of the unibyte case is - see e.g. 
> gnulib/lib/mbscasecmp.c
> as an example -
> 
>   #if HAVE_MBRTOWC
>     if (MB_CUR_MAX > 1)
>       ... multibyte case ...
>     else
>   #endif
>       ... unibyte case ...
> 
> but it does have code duplication.

That's the obvious solution that is not really required/desired.

If I was being paid to do it (I have very little free time unfortunately),
then I would do something like...

1. identify filters that require multibyte handling.
2. refactor line input processing etc. to shared code.
3. Intelligently apply multibyte processing.

For illustration look at the performance various `uniq` implementations 
currently:

$ rpm -q coreutils
coreutils-6.9-9.fc8

$ echo $LANG
en_IE.UTF-8

# The default one uses the existing i18n patch
$ time uniq < lines.test > /dev/null
real    0m27.724s

$ time LC_CTYPE=C uniq < lines.test > /dev/null
real    0m1.314s

$time ~/git/coreutils/src/uniq < lines.test > /dev/null
real    0m1.187s

$ time ~/myuniq < lines.test > /dev/null
real    0m0.827s

$ time ~/uniq.py < lines.test > /dev/null
real    0m2.657s

Yes the python version (which I nearly wrote in the same
time and the default uniq took to complete the test) is much better!

`myuniq` is a version I implemented from scratch,
to understand some of what the issues involved would be:
http://lists.gnu.org/archive/html/bug-coreutils/2006-07/msg00153.html

It's not just performance. The functionality of the i18n patch for uniq
is buggy in the presence of NUL characters for example:

for i in 1 2 3; do echo -e "1234\x0056789"; done | uniq
123456789
123456789
123456789

for i in 1 2 3; do echo -e "1234\x0056789"; done | LANG=C uniq
123456789

It's great that Paul & Jim are looking at this interesting project
as it really is important as I've mentioned before.

cheers,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]