[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
range of characters doesn't match as expected if IGNORECASE is set and l
From: |
James Troup |
Subject: |
range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max > |
Date: |
Fri, 26 Nov 2004 18:42:04 +0000 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
Hi,
A debian developer, Fumitoshi UKAI <address@hidden>, reported[1]
the following bug in gawk 3.1.4:
| On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
| [a-a] doesn't match with A as expected if IGNORECASE is set.
|
| For example,
| % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
| A
|
| % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
| %
| # wrong, A should match [a-a] when IGNORECASE=1
|
| If GAWK_NO_DFA=1, it works fine as well as LANG=C.
| % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1}
/[a-a]+/{print}'
| A
| %
|
| Note that [a-z] will match with A, that is not because IGNORECASE works,
| but because collation order in UTF-8 is "a A b B .. z".
| That is, [a-z] won't match with Z even if IGNORECASE=1.
|
| % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
| %
| I think this patch fixes this problem:
|
| --- dfa.c.orig 2004-10-13 02:27:29.000000000 +0900
| +++ dfa.c 2004-10-13 02:27:54.000000000 +0900
| @@ -682,6 +682,28 @@
| REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
| range_ends_al, work_mbc->nranges + 1);
| work_mbc->range_ends[work_mbc->nranges++] = (wchar_t)wc2;
| + if (case_fold && (iswlower((wint_t)wc) || iswupper((wint_t)wc))
| + && (iswlower((wint_t)wc2) ||
iswupper((wint_t)wc2))) {
| + wint_t altcase;
| + altcase = wc;
| + if (iswlower((wint_t)wc))
| + altcase = towupper((wint_t)wc);
| + else
| + altcase = towlower((wint_t)wc);
| + REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
| + range_sts_al, work_mbc->nranges + 1);
| + work_mbc->range_sts[work_mbc->nranges] = (wchar_t)altcase;
| +
| + altcase = wc2;
| + if (iswlower((wint_t)wc2))
| + altcase = towupper((wint_t)wc2);
| + else
| + altcase = towlower((wint_t)wc2);
| + REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
| + range_ends_al, work_mbc->nranges + 1);
| + work_mbc->range_ends[work_mbc->nranges++] =
(wchar_t)altcase;
| +
| + }
| }
| else if (wc != WEOF)
| /* build normal characters. */
--
James
[1] http://bugs.debian.org/276206
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max >,
James Troup <=