bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

range of characters doesn't match as expected if IGNORECASE is set and l


From: James Troup
Subject: range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max >
Date: Fri, 26 Nov 2004 18:42:04 +0000
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

Hi,

A debian developer, Fumitoshi UKAI <address@hidden>, reported[1]
the following bug in gawk 3.1.4:

| On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales, 
| [a-a] doesn't match with A as expected if IGNORECASE is set.
| 
| For example,
|  % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
|  A              
| 
|  % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
|  %              
|  # wrong, A should match [a-a] when IGNORECASE=1
| 
| If GAWK_NO_DFA=1, it works fine as well as LANG=C.
|  % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} 
/[a-a]+/{print}'
|  A
|  %
| 
| Note that [a-z] will match with A, that is not because IGNORECASE works,
| but because collation order in UTF-8 is "a A b B .. z".  
| That is, [a-z] won't match with Z even if IGNORECASE=1.
| 
|  % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
|  %

| I think this patch fixes this problem:
| 
| --- dfa.c.orig  2004-10-13 02:27:29.000000000 +0900
| +++ dfa.c       2004-10-13 02:27:54.000000000 +0900
| @@ -682,6 +682,28 @@
|           REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
|                                range_ends_al, work_mbc->nranges + 1);
|           work_mbc->range_ends[work_mbc->nranges++] = (wchar_t)wc2;
| +         if (case_fold && (iswlower((wint_t)wc) || iswupper((wint_t)wc))
| +                         && (iswlower((wint_t)wc2) || 
iswupper((wint_t)wc2))) {
| +                 wint_t altcase;
| +                 altcase = wc;
| +                 if (iswlower((wint_t)wc))
| +                         altcase = towupper((wint_t)wc);
| +                 else
| +                         altcase = towlower((wint_t)wc);
| +                 REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
| +                                 range_sts_al, work_mbc->nranges + 1);
| +                 work_mbc->range_sts[work_mbc->nranges] = (wchar_t)altcase;
| +
| +                 altcase = wc2;
| +                 if (iswlower((wint_t)wc2))
| +                         altcase = towupper((wint_t)wc2);
| +                 else
| +                         altcase = towlower((wint_t)wc2);
| +                 REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
| +                                 range_ends_al, work_mbc->nranges + 1);
| +                 work_mbc->range_ends[work_mbc->nranges++] = 
(wchar_t)altcase;
| +
| +         }
|         }
|        else if (wc != WEOF)
|         /* build normal characters.  */

-- 
James

[1] http://bugs.debian.org/276206




reply via email to

[Prev in Thread] Current Thread [Next in Thread]