bug#20751: wc -m doesn't count UTF-8 characters properly

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20751: wc -m doesn't count UTF-8 characters properly

From:	Stephane Chazelas
Subject:	bug#20751: wc -m doesn't count UTF-8 characters properly
Date:	Sun, 7 Jun 2015 22:47:29 +0100
User-agent:	Mutt/1.5.21 (2010-09-15)

2015-06-06 21:49:16 +0300, Valdis Vītoliņš:
> Note, that UTF-8 characters can be counted by counting bytes with bit
> patterns 0xxxxxxx or 11xxxxxx:
> https://en.wikipedia.org/wiki/UTF-8#Description
> 
> So, general logic should be, that, if:
> a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
> b) first two bytes of file are 0xFE 0xFF
> https://en.wikipedia.org/wiki/Byte_order_mark
> 
> then count bytes with bits 0xxxxxxx and 11xxxxxx.
[...]


Except that only valid characters should be counted. And there,
the definition of valid character is not always clear.

At least an incorrect UTF-8 encoding can't count as valid
characters.

So

printf '\300' | wc -m

should return 0 as 11000000 alone is not a valid character so we
can't use your algorithm without first verifying the validity of
the input.

Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to
0xDFFF) should probably be excluded as well:

printf '\355\240\200' | wc -m

should return 0 for instance..

And maybe code-points above 0x11FFFF now since Unicode seem to
have given up on ever defining characters above that (probably
because of the UTF16 limitation).

Now even in the range 0 -> D700, E000-> 0x11FFFF, there are
still thousands of code points that are not defined yet in the
latest Unicode version. I suppose we can imagine locale
definitions  where each of the known characters are listed and
the rest rejected...

-- 
Stephane

[Prev in Thread]

Current Thread

[Next in Thread]

bug#20751: wc -m doesn't count UTF-8 characters properly, Glenn Morris, 2015/06/06
- bug#20751: wc -m doesn't count UTF-8 characters properly, Valdis Vītoliņš, 2015/06/06
  - bug#20751: wc -m doesn't count UTF-8 characters properly, Pádraig Brady, 2015/06/06
    - bug#20751: wc -m doesn't count UTF-8 characters properly, Valdis Vītoliņš, 2015/06/07
  - bug#20751: wc -m doesn't count UTF-8 characters properly, Stephane Chazelas <=

Prev by Date: bug#20751: wc -m doesn't count UTF-8 characters properly
Next by Date: bug#20767: seq invocation limitations documentation
Previous by thread: bug#20751: wc -m doesn't count UTF-8 characters properly
Next by thread: bug#20767: seq invocation limitations documentation
Index(es):
- Date
- Thread