bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#53145: "cut" can't segment Chinese characters correctly?


From: zendas
Subject: bug#53145: "cut" can't segment Chinese characters correctly?
Date: Sun, 09 Jan 2022 19:51:33 +0000

Create a new test2.txt, the content is
星期一
星期二
星期三
星期四
星期五
星期六
星期日
=============================
zendas@Backup-Server:/tmp$ cat test2.txt
星期一
星期二
星期三
星期四
星期五
星期六
星期日
zendas@Backup-Server:/tmp$
=============================
zendas@Backup-Server:/tmp$ cut -c 1 test2.txt
�
�
�
�
�
�
�
zendas@Backup-Server:/tmp$ cut -c 2 test2.txt
�
�
�
�
�
�
�
zendas@Backup-Server:/tmp$ cut -c 1-3 test2.txt
星
星
星
星
星
星
星
zendas@Backup-Server:/tmp$
=============================
Reference source:
https://blog.csdn.net/m0_38110132/article/details/79883827

my environment is:
zendas@Backup-Server:~$ cat /etc/debian_version
11.1
zendas@Backup-Server:~$ cut --version
cut (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
授權條款 GPLv3+:GNU 通用公共授權條款第 3 版或更新版本 <https://gnu.org/licenses/gpl.html>。
本軟體是自由軟體:您可以自由修改和重新發布它。
在法律範圍內沒有其他保證。

由 David M. Ihnat、David MacKenzie 和 Jim Meyering 編寫。

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

在 2022年1月10日 星期一 上午 3:40,Bob Proulx <bob@proulx.com> 寫道:

> zendas wrote:
>
> > Hello, I need to get Chinese characters from the string. I googled a
> >
> > lot of documents, it seems that the -c parameter of cut should be
> >
> > able to meet my needs, but I even directly execute the instructions
> >
> > on the web page, and the result is different from the
> >
> > demonstration. I have searched dozens of pages but the results are
> >
> > not the same as the demo, maybe this is a bug?
>
> Unfortunately the example was attached as images instead of as plain
>
> text. Please in the future copy and paste the example as text rather
>
> than as an image. As an image it is impossible to reproduce by trying
>
> to copy and paste the image. As an image it is impossible to search
>
> for the strings.
>
> The images were also lost somehow from the various steps in the
>
> mailing list pipelines with this message. First it was classified as
>
> spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114). I
>
> caught it in review and re-sent the message. That may have been the
>
> problem specifically with images.
>
> > For example:
> >
> > https://blog.csdn.net/xuzhangze/article/details/80930714
> >
> > [20180705173450701.png]
> >
> > the result of my attempt:
> >
> > [螢幕快照 2022-01-10 02:49:46.png]
>
> One of the two images:
>
> https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png
>
> Second problem is that the first image shows as being corrupted. I
>
> can view the original however. To my eye they are similar enough that
>
> the one above is sufficient and I do not need to re-send the corrupted
>
> image.
>
> As to the problem you have reported it is due to lack of
>
> internationalization support for characters. -c is the same as -b at
>
> this moment.
>
> https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation
>
> ‘-c CHARACTER-LIST’
>
> ‘--characters=CHARACTER-LIST’
>
> Select for printing only the characters in positions listed in
>
> CHARACTER-LIST. The same as ‘-b’ for now, but internationalization
>
> will change that. Tabs and backspaces are treated like any other
>
> character; they take up 1 character. If an output delimiter is
>
> specified, (see the description of ‘--output-delimiter’), then
>
> output that string between ranges of selected bytes.
>
> For multi-byte UTF-8 characters the -c option will operate the same as
>
> the -b option as of the current version and is not suitable for
>
> dealing with multi-byte characters.
>
> $ echo '螢幕快照'
>
> 螢幕快照
>
> $ echo '螢幕快照' | cut -c 1
>
> ?
>
> $ echo '螢幕快照' | cut -c 1-3
>
>
>
> $ echo '螢幕快照' | cut -b 1-3
>
>
>
> If the characters are known to be 3 bytes multi-characters then I
>
> might suggest using -b to workaround the problem assuming 3 byte
>
> characters. Eventually when -c is coded to handle multi-byte
>
> characters the handling as bytes will change. Using -b would avoid
>
> that change.
>
> Some operating systems have patched that specific version of utilities
>
> locally to add multi-byte character handling. But the patches have
>
> not been found acceptable for inclusion. That is why there are
>
> differences between different operating systems.
>
> Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]