bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8598: Bug in uniq?


From: Eric Blake
Subject: bug#8598: Bug in uniq?
Date: Sat, 30 Apr 2011 11:56:55 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110307 Fedora/3.1.9-0.39.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.9

On 04/30/2011 06:03 AM, emijrp wrote:
> Hi all;
> 
> I'm not sure if this is a bug.

Most likely not a bug, but a function of your locale.

> 
> If I download this file[1], unzip and do:
> 
> grep "<title>" wikiindexorg-20110409-history.xml | sort | uniq -D
> 
> It shows:
> 
>     <title>Felix Pleşoianu Wiki</title>
>     <title>Felix Pleșoianu Wiki</title>

Identical.  (How, you ask? Read on.)

>     <title>ᐧᐃᑭᐱᑎᔭ</title>
>     <title>위키낱말사전</title>

Identical.

>     <title>ウィクショナリー</title>
>     <title>언사이클로피디어</title>

Identical.

>     <title>ไทย Wikipedia</title>
>     <title>한국어 Wikipedia</title>

Identical.

> 
> But obviously, they are all different lines. Why?

That depends on your locale.  In the C locale, all of those lines are
distinct except for the first two.  But in other locales, strcoll()
compares lines equal depending on your current locale, and if your
current locale punts and collates all non-ASCII characters as the same
collation symbol, then those lines are identical.

I was able to reproduce your results with the en_US.UTF-8 locale that
ships with Fedora 14.  To see the difference, try again with:

$ grep "<title>" wikiindexorg-20110409-history.xml | sort \
    | LC_ALL=C uniq --all-repeated=separate
$ grep "<title>" wikiindexorg-20110409-history.xml | sort \
    | LC_ALL=en_US.UTF-8 uniq --all-repeated=separate

    <title>Felix Pleşoianu Wiki</title>
    <title>Felix Pleșoianu Wiki</title>

    <title>ᐧᐃᑭᐱᑎᔭ</title>
    <title>위키낱말사전</title>

    <title>ウィクショナリー</title>
    <title>언사이클로피디어</title>

    <title>ไทย Wikipedia</title>
    <title>한국어 Wikipedia</title>

This is because that particular locale does not try to distinguish a
collation sequence for non-English characters.

-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]