bug#22155: Wrong char count with UTF8 in sort -k

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22155: Wrong char count with UTF8 in sort -k

From:	Holger Klene
Subject:	bug#22155: Wrong char count with UTF8 in sort -k
Date:	Sat, 12 Dec 2015 23:53:40 +0100
User-agent:	KMail/4.14.6 (Linux/3.19.0-39-generic; KDE/4.14.6; x86_64; ; )

Hello!

Given a text-file "sort.but.txt" with find-output like this:

07. Feb 2015 15:57 ./mess.jpg

05. Mär 2015 13:30 ./mess.jpg

Basically two columns: a date and a filename

I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k to skip over the date column

> sort sort.bug.txt -u -s -k 1.20 --debug

sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet

sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher

wahrscheinlich auch „b“ angeben

05. Mär 2015 13:30 ./mess.jpg

___________

07. Feb 2015 15:57 ./mess.jpg

__________

As the underlines in debug mode show, the keys start position depends on whether the month name contains pure ASCII or the German Umlaut ä.

There's a hint coming up, to apply option -b as this one character offset could possibly be overcome thanks to the separating whitespace between the columns.

> sort sort.bug.txt -u -s -k 1.20 -b --debug

sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet

05. Mär 2015 13:30 ./mess.jpg

__________

07. Feb 2015 15:57 ./mess.jpg

__________

In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without.

This is: sort (GNU coreutils) 8.23

Thanks for the great utilities.

Holger

|_|/ MfG

| |\ Holger Klene

PGP Key ID: 0x22FFE57E

signature.asc
Description: This is a digitally signed message part.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#22155: Wrong char count with UTF8 in sort -k, Holger Klene <=
- bug#22155: Wrong char count with UTF8 in sort -k, Pádraig Brady, 2015/12/12
  - bug#22155: Wrong char count with UTF8 in sort -k, Pádraig Brady, 2015/12/12

Prev by Date: bug#22151: tail error
Next by Date: bug#22155: Wrong char count with UTF8 in sort -k
Previous by thread: bug#22151: tail error
Next by thread: bug#22155: Wrong char count with UTF8 in sort -k
Index(es):
- Date
- Thread