bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

regular expression yields random result in UTF-8 locale


From: Bruno Haible
Subject: regular expression yields random result in UTF-8 locale
Date: Fri, 10 Dec 2004 17:52:04 +0100
User-agent: KMail/1.5

Hi,

The behaviour of regular expressions in an UTF-8 locale, even when
applied to pure ASCII input, yields results that don't depend only
on the input line.

gawk version: 3.1.4
compiled on: Linux/x86, glibc-2.3.3
with: gcc-3.3.1
using regex: from the gawk package

$ nm `which gawk` | grep ' reg'
08070920 T regcomp
08070a20 T regerror
080747d0 T regexec
0805ecc0 t regexp
08070ca0 T regfree
0806ffd0 t register_state

Program:
======================== extract.awk ========================
/^[a-zA-Z_][a-zA-Z0-9_]*/ {
    printf("In rule 1: " $0 "\n");
}
/^\{ *$/ {
    printf("In rule 2: " $0 "\n");
}
{
    printf("In rule 3: " $0 "\n");
}
=============================================================

Input files:
========================= input1 =========================
intern void *pth_scheduler(void *dummy)
{
}
==========================================================
========================= input2 =========================
/* the heart of this library: the thread scheduler */
intern void *pth_scheduler(void *dummy)
{
}
==========================================================

Operation in C locale and de_DE (ISO-8859-1) locale:

$ LC_ALL=C gawk -f extract.awk < input1
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 2: {
In rule 3: {
In rule 3: }
$ LC_ALL=C gawk -f extract.awk < input2
In rule 3: /* the heart of this library: the thread scheduler */
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 2: {
In rule 3: {
In rule 3: }
$ LC_ALL=de_DE gawk -f extract.awk < input1
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 2: {
In rule 3: {
In rule 3: }
$ LC_ALL=de_DE gawk -f extract.awk < input2
In rule 3: /* the heart of this library: the thread scheduler */
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 2: {
In rule 3: {
In rule 3: }

This is all OK. Now the de_DE.UTF-8 locale:

$ LC_ALL=de_DE.UTF-8 gawk -f extract.awk < input1
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 1: {
In rule 2: {
In rule 3: {
In rule 1: }
In rule 3: }
$ LC_ALL=de_DE.UTF-8 gawk -f extract.awk < input2
In rule 3: /* the heart of this library: the thread scheduler */
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 2: {
In rule 3: {
In rule 3: }

The expected output is the same as the one from the de_DE locale, because
the definition of ranges in regular expressions is the same in the de_DE
and de_DE.UTF-8 locales.

The output has two bugs:
  - In the input1 case, the rule 1 SHOULD NOT apply to the lines "{" and "}".
  - In the input2 case, the rule 1 SHOULD apply to the line "intern ...".

One could guess that rule 1 is matched depending on the first line, but
this is not the case. Consider

========================= input3 =========================
/* the heart of this library: the thread scheduler
 */
intern void *pth_scheduler(void *dummy)
{
}
==========================================================
$ LC_ALL=de_DE.UTF-8 gawk -f extract.awk < input3
In rule 3: /* the heart of this library: the thread scheduler
In rule 3:  */
In rule 1: intern void *pth_scheduler(void *dummy)
In rule 3: intern void *pth_scheduler(void *dummy)
In rule 1: {
In rule 2: {
In rule 3: {
In rule 1: }
In rule 3: }

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]