bug#22155: Wrong char count with UTF8 in sort -k

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22155: Wrong char count with UTF8 in sort -k

From:	Pádraig Brady
Subject:	bug#22155: Wrong char count with UTF8 in sort -k
Date:	Sun, 13 Dec 2015 01:32:47 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 12/12/15 22:53, Holger Klene wrote:
> Hello!
> 
>  
> 
> Given a text-file "sort.but.txt" with find-output like this:
> 
> 07. Feb 2015 15:57 ./mess.jpg
> 05. Mär 2015 13:30 ./mess.jpg
> 
>  
> 
> Basically two columns: a date and a filename
> 
> I want sort to discard the duplicate lines for the same file using -u to keep 
> only the first and -k to skip over the date column
> 
>> sort sort.bug.txt -u -s -k 1.20 --debug

Note the -s is implicit with -u.
Ideally the above should just work, and does
on Fedora/RHEL/Suse with the i18n patch applied.
Details on that patch at
http://www.pixelbeat.org/docs/coreutils_i18n/

> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
> sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher
> wahrscheinlich auch „b“ angeben
> 05. Mär 2015 13:30 ./mess.jpg
>                   ___________
> 07. Feb 2015 15:57 ./mess.jpg
>                    __________
> 
> As the underlines in debug mode show, the keys start position depends on 
> whether the month name contains pure ASCII or the German Umlaut ä.
> 
> There's a hint coming up, to apply option -b as this one character offset 
> could possibly be overcome thanks to the separating whitespace between the 
> columns.
> 
>> sort sort.bug.txt -u -s -k 1.20 -b --debug
> 
> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
> 05. Mär 2015 13:30 ./mess.jpg
>                    __________
> 07. Feb 2015 15:57 ./mess.jpg
>                    __________
> 
> In fact, it does correct the underlines, but still -u gives both lines, 
> though I want it to discard the second line. You can add more lines for the 
> same file, but sort insists on keeping exactly two: one with Umlaut and the 
> other without.

That's a bug in --debug because the implementation was split
from the actual processing done during the sort (for performance reasons).
Therefore we'll need to fix --debug to show what's being actually done
which is...

-b is applied _before_ the -k offsets are determined,
and so is ineffective in your case.
That is confirmed with:

$ ltrace -e strcoll sort sort.bug.txt -u -k 1.20b
sort->strcoll("./mess.jpg", " ./mess.jpg")                                      
  = 15
05. Mär 2015 13:30 ./mess.jpg
sort->strcoll("./mess.jpg", " ./mess.jpg")                                      
  = 15
07. Feb 2015 15:57 ./mess.jpg

Perhaps it would be better in your case to operate
directly on the fifth field?

$ sort sort.bug.txt -u -k5b,5 --debug
sort: using ‘en_IE.utf8’ sorting rules
07. Feb 2015 15:57 ./mess.jpg
                   __________

thanks,
Pádraig

[Prev in Thread]

Current Thread

[Next in Thread]

bug#22155: Wrong char count with UTF8 in sort -k, Holger Klene, 2015/12/12
- bug#22155: Wrong char count with UTF8 in sort -k, Pádraig Brady <=
  - bug#22155: Wrong char count with UTF8 in sort -k, Pádraig Brady, 2015/12/12

Prev by Date: bug#22155: Wrong char count with UTF8 in sort -k
Next by Date: bug#22155: Wrong char count with UTF8 in sort -k
Previous by thread: bug#22155: Wrong char count with UTF8 in sort -k
Next by thread: bug#22155: Wrong char count with UTF8 in sort -k
Index(es):
- Date
- Thread