bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#53145: "cut" can't segment Chinese characters correctly?


From: Bob Proulx
Subject: bug#53145: "cut" can't segment Chinese characters correctly?
Date: Sun, 9 Jan 2022 12:40:20 -0700

zendas wrote:
> Hello, I need to get Chinese characters from the string. I googled a
> lot of documents, it seems that the -c parameter of cut should be
> able to meet my needs, but I even directly execute the instructions
> on the web page, and the result is different from the
> demonstration. I have searched dozens of pages but the results are
> not the same as the demo, maybe this is a bug?

Unfortunately the example was attached as images instead of as plain
text.  Please in the future copy and paste the example as text rather
than as an image.  As an image it is impossible to reproduce by trying
to copy and paste the image.  As an image it is impossible to search
for the strings.

The images were also lost somehow from the various steps in the
mailing list pipelines with this message.  First it was classified as
spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114).  I
caught it in review and re-sent the message.  That may have been the
problem specifically with images.

> For example:
> https://blog.csdn.net/xuzhangze/article/details/80930714
> [20180705173450701.png]
> the result of my attempt:
> [螢幕快照 2022-01-10 02:49:46.png]

One of the two images:

    
https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png

Second problem is that the first image shows as being corrupted.  I
can view the original however.  To my eye they are similar enough that
the one above is sufficient and I do not need to re-send the corrupted
image.

As to the problem you have reported it is due to lack of
internationalization support for characters.  -c is the same as -b at
this moment.

    
https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation

    ‘-c CHARACTER-LIST’
    ‘--characters=CHARACTER-LIST’
         Select for printing only the characters in positions listed in
         CHARACTER-LIST.  The same as ‘-b’ for now, but internationalization
         will change that.  Tabs and backspaces are treated like any other
         character; they take up 1 character.  If an output delimiter is
         specified, (see the description of ‘--output-delimiter’), then
         output that string between ranges of selected bytes.

For multi-byte UTF-8 characters the -c option will operate the same as
the -b option as of the current version and is not suitable for
dealing with multi-byte characters.

    $ echo '螢幕快照'
    螢幕快照
    $ echo '螢幕快照' | cut -c 1
    ?
    $ echo '螢幕快照' | cut -c 1-3
    螢
    $ echo '螢幕快照' | cut -b 1-3
    螢

If the characters are known to be 3 bytes multi-characters then I
might suggest using -b to workaround the problem assuming 3 byte
characters.  Eventually when -c is coded to handle multi-byte
characters the handling as bytes will change.  Using -b would avoid
that change.

Some operating systems have patched that specific version of utilities
locally to add multi-byte character handling.  But the patches have
not been found acceptable for inclusion.  That is why there are
differences between different operating systems.

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]