bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug report


From: Bob Proulx
Subject: Re: Bug report
Date: Sat, 3 Dec 2005 12:31:19 -0700
User-agent: Mutt/1.5.9i

Aharon Robbins wrote:
> I think this has become the #1 gawk FAQ.

It has been a very frequent FAQ for all of the utilities.  I have been
answering this one in coreutils for years.

> The answer is that it's not a bug. It has to do with the locale you're
> using.  If you set LC_ALL=C in your environment, gawk's behavior will
> be what you expect.

However setting LC_ALL will override *all* of the locale settings.
Unfortunately this has some undesirable effects such as overriding the
enabling of UTF-8 character sets in most environments.

Personally I use the following settings in my environment.  I want
UTF-8 but with a standard sort ordering for US-ASCII characters.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

<rant>

Personally I believe that the locale tables are broken.  By design
they fold case and ignore whitespace and punctuation.  This is
commonly called "dictionary" sort ordering.  It is not what people
sorting data would expect.  Working with data on a computer is what
people working on computers normally do.  Having this behavior be the
default without the user's knowledge leads to confusion and problems.

But this ordering is specified by international standards.  It was
apparently considered the desireable behavior by people thinking about
natural languages and not computer languages.  In natural languages
sorting is done in dictionary sort ordering.  If you were preparing a
dictionary then you would want this sort ordering.

This means that the sort ordering is behaving exactly as it was
designed to behave when that locale is selected.  It was explicitly
implemented with this behavior.  Therefore it is not a bug.  The only
way to change this is to select a different locale setting which
behaves differently.  Unfortunately no such single locale which
implements UTF-8 and data sorting ordering exists leaving users
working with data on a computer without a reasonable configuration
choice.

</rant>

Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]