[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: tr broken with accented chars
From: |
Eric Blake |
Subject: |
Re: tr broken with accented chars |
Date: |
Fri, 21 Apr 2006 15:57:24 +0000 |
> I have a problem with tr, version 5.94 :
> I'm using debian with a 100% utf-8 system. It is not an x-term related
> problem (this also occurs in a vt). Quoting the arguments (tr "é" "e")
> does not help.
Thanks for the report. However, upstream coreutils does not yet
support multi-byte characters. The TODO file documents the need
for a nice patch that handles multibyte characters cleanly, while
not penalizing speed of strict single-byte locales; and so far, while
several vendors have provided add-on patches that attempt
this, none of them have been considered clean enough to apply
upstream.
> address@hidden:~$ echo hello | tr o a # no problem here
> hella
Even in utf-8, all these characters are single bytes.
>
> address@hidden:~$ echo hé | tr é e # why do I get 2 'e' ?
> hee
In utf-8, é occupies 2 bytes, but e occupies one, and single-byte
translation is occuring, so this bit from the info pages is relevant:
"On the other hand, making SET1 longer than SET2 is not portable;
POSIX says that the result is undefined. In this situation, BSD `tr'
pads SET2 to the length of SET1 by repeating the last character of SET2
as many times as necessary. System V `tr' truncates SET1 to the length
of SET2."
Thus, both utf-8 bytes of é are being translated into the
expanded SET2 of ee.
>
> address@hidden:~$ echo hé | tr à a # here tr should do nothing...
> ha(c)
>
Again, é and à are multibyte, and share a common byte, so with
single-byte translation, the common byte is translated to a, and
the remaining byte is passed through unchanged but now
forms an illegal utf-8 sequence.
--
Eric Blake