emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#56352: closed (UTF-8 LC_CTYPE bug esp when a certain range of Korean


From: GNU bug Tracking System
Subject: bug#56352: closed (UTF-8 LC_CTYPE bug esp when a certain range of Korean characters)
Date: Sat, 02 Jul 2022 21:29:02 +0000

Your message dated Sat, 2 Jul 2022 16:28:40 -0500
with message-id <6dc73457-0b41-ce63-c4c1-9c329848c766@cs.ucla.edu>
and subject line Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of 
Korean characters
has caused the debbugs.gnu.org bug report #56352,
regarding UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs@gnu.org.)


-- 
56352: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=56352
GNU Bug Tracking System
Contact help-debbugs@gnu.org with problems
--- Begin Message --- Subject: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters Date: Sat, 2 Jul 2022 14:41:41 +0900
Grep (and also Sed) cannot match a certain range of Korean characters when it operates under LC_CTYPE=C.UTF-8 (and whatever language environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or ja_JP.UTF-8 etc.)

Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- a character that is in the range [가-폿] (<UAC00>~<UD3FF>)
                         is matched without any issue
$ echo 퐀 | grep .
$                    <-- but a character in the range [퐀-힣] (<UD400>~<UD7A3>)
                         CANNOT be matched but it IS SUPPOSED TO be matched.

Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related to <regex.h> or <iconv.h> on Glibc, but I couldn't find a way to reproduce the bug with those, so alternatively, I report on Grep instead.

P. S. For some reason, I think my email address (not this) was rejected by the server, and I don't know why. So I post it again using a Gmail account instead.

--- End Message ---
--- Begin Message --- Subject: Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters Date: Sat, 2 Jul 2022 16:28:40 -0500 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1
Thanks, that's a Gnulib bug that was fixed here:

https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b19a10775e54f8ed17e3a8c08a72d261d8c26244

This has been propagated to GNU Grep and the fix should appear in the next Grep release. I plan to reply separately about GNU Sed.



--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]