bug#56352: UTF-8 LC_CTYPE bug esp when a certain range of Korean charact

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#56352: UTF-8 LC_CTYPE bug esp when a certain range of Korean charact

From:	김태엽
Subject:	bug#56352: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Date:	Sat, 2 Jul 2022 14:41:41 +0900

Grep (and also Sed) cannot match a certain range of Korean characters when
it operates under LC_CTYPE=C.UTF-8 (and whatever language environment with
UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or ja_JP.UTF-8 etc.)

Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- a character that is in the range [가-폿]
(<UAC00>~<UD3FF>)
                         is matched without any issue
$ echo 퐀 | grep .
$                    <-- but a character in the range [퐀-힣]
(<UD400>~<UD7A3>)
                         CANNOT be matched but it IS SUPPOSED TO be matched.

Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related to <regex.h> or <iconv.h> on Glibc, but I couldn't
find a way to reproduce the bug with those, so alternatively, I report on
Grep instead.

P. S. For some reason, I think my email address (not this) was rejected by
the server, and I don't know why. So I post it again using a Gmail account
instead.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#56352: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters, 김태엽 <=

Prev by Date: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Next by Date: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Previous by thread: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Next by thread: bug#56453: Bug reports
Index(es):
- Date
- Thread