[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
tr not respecting UTF-8 locale ?
From: |
Michał Kosmulski |
Subject: |
tr not respecting UTF-8 locale ? |
Date: |
Mon, 11 Oct 2004 15:41:44 +0200 |
User-agent: |
Mozilla Thunderbird 0.8 (X11/20040913) |
hello,
I am using a UTF-8 locale and all coreutils except for tr seem to agree
on that. However, tr behaves differently and seems to always assume that
1 byte == 1 character even in a UTF-8 locale. Consider this:
address@hidden:~$ locale
LANG=en_US.UTF-8
LC_CTYPE=pl_PL.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=pl_PL.UTF-8
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=pl_PL.UTF-8
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT=pl_PL.UTF-8
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
address@hidden:~$ tr äöü aou
xxäyyözzütt
xxuoyyuuzzuutt
address@hidden:~$ tr ä ab
xäy
xaby
IMO, this clearly indicates that tr considers each character to be
exactly one byte wide. In the case of "tr äöü aou" all three umlauts
have the same first byte, so the last substitution for that byte is
effective, that's why ä becomes uo and not ao in that case.
Michal Kosmulski
--
Michal Kosmulski
http://hektor.umcs.lublin.pl/~mikosmul/
- tr not respecting UTF-8 locale ?,
Michał Kosmulski <=