bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort --ignore-case option changes underscore sort position


From: Bob Proulx
Subject: Re: sort --ignore-case option changes underscore sort position
Date: Thu, 21 Aug 2008 23:27:13 -0600
User-agent: Mutt/1.5.13 (2006-08-11)

I am CC'ing address@hidden because that is actually the home
mailing list for the sort command.  Followups should go there and I
have set Mail-Followup-To: appropriately.

jrw32982 wrote:
> I couldn't find this previously reported.  I didn't see anything in
> the documentation which indicates that this is expected behavior.

Thank you for the report.  However even though this is perhaps
surprising I think it does count as expected behavior.

> To replicate the bug:

Thank you for that very nice test case!  That was excellent.

> $ sort --version
> sort (coreutils) 5.2.1
> ...
> $ export LC_ALL=C
> $ { echo a_; echo ax; } | sort
> a_
> ax
> $ { echo a_; echo ax; } | sort --ignore-case
> ax
> a_

The documentation for --ignore-case explains what is happening.
In the man page for sort:

       -f, --ignore-case
              fold lower case to upper case characters

And of course the info documentation has the full authoritative
documentation.

  `-f'
  `--ignore-case'
       Fold lowercase characters into the equivalent uppercase characters
       when comparing so that, for example, `b' and `B' sort as equal.
       The `LC_CTYPE' locale determines character types.

Therefore your test case:

  { echo a_; echo ax; } | sort --ignore-case

Is really the same as:

  $ { echo a_; echo ax; } | sort
  a_
  ax

  $ { echo A_; echo AX; } | sort
  AX
  A_

  $ { echo A_; echo AX; } | sort --ignore-case
  AX
  A_

When using upper case you can see that it is equivalent to using the
--ignore-case option.  Perhaps this should have been more accurately
called --convert-to-upper-case-before-sorting.

The surprising part might be realizing that underscore collates
between the upper and lower case letters when using the C/POSIX
standard sort ordering.  That is the standard legacy behavior.  It
does this along with [ \ ] ^ _ ` which all occur between Z and a in
the US-ASCII code table.  To ignore these look at the
--dictionary-order option.

  `-d'
  `--dictionary-order'
       Sort in "phone directory" order: ignore all characters except
       letters, digits and blanks when sorting.  The `LC_CTYPE' locale
       determines character types.

And of course alternative sort ordering is provided by, for example,
the en_US.UTF-8 locale which orders in what amounts to dictionary
ordering.

Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]