bug#34524: wc: word count incorrect when words separated only by no-brea

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34524: wc: word count incorrect when words separated only by no-brea

From:	Bob Proulx
Subject:	bug#34524: wc: word count incorrect when words separated only by no-break space
Date:	Fri, 22 Feb 2019 16:34:04 -0700
User-agent:	Mutt/1.10.1 (2018-07-13)

address@hidden wrote:
> The man page for wc states: "A word is a... sequence of characters delimited 
> by white space."
> 
> But its concept of white space only seems to include ASCII white
> space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.

Indeed this is because wc and other coreutils programs, and other
programs, use the libc locale definition.

  $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 od -tx1 -c
  0000000  c2  a0  0a
          302 240  \n
  0000003

  printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
  0
  $ printf '\xC2\xA0 \n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
  1

This shows that grep does not recognize \xC2\xA0 as a character in the
class of space characters either.

  $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 tr '[[:space:]]' x | od -tx1 -c
  0000000  c2  a0  78
          302 240   x
  0000003

And while a space character matches and is translated the other is not.

Since character classes are defined as part of the locale table there
isn't really anything we can do about it on the coreutils wc side of
things.  It would need to be redefined upstream there.

Bob

[Prev in Thread]

Current Thread

[Next in Thread]

bug#34524: wc: word count incorrect when words separated only by no-break space, vampyrebat, 2019/02/18
- bug#34524: wc: word count incorrect when words separated only by no-break space, Bob Proulx <=
- bug#34524: wc: word count incorrect when words separated only by no-break space, Pádraig Brady, 2019/02/24
  - bug#34524: wc: word count incorrect when words separated only by no-break space, Bruno Haible, 2019/02/24
    - bug#34524: wc: word count incorrect when words separated only by no-break space, Paul Eggert, 2019/02/24
    - bug#34524: wc: word count incorrect when words separated only by no-break space, Pádraig Brady, 2019/02/24
    - bug#34524: wc: word count incorrect when words separated only by no-break space, Pádraig Brady, 2019/02/24
    - bug#34524: wc: word count incorrect when words separated only by no-break space, Pádraig Brady, 2019/02/25

Prev by Date: bug#34447: `pwd` doesn't show real working directory if directory is renamed by another session
Next by Date: bug#34524: wc: word count incorrect when words separated only by no-break space
Previous by thread: bug#34524: wc: word count incorrect when words separated only by no-break space
Next by thread: bug#34524: wc: word count incorrect when words separated only by no-break space
Index(es):
- Date
- Thread