bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [IC/Bugs] Uninterpreted byte ranges in REs


From: Aharon Robbins
Subject: Re: [IC/Bugs] Uninterpreted byte ranges in REs
Date: Fri, 31 Oct 2008 14:19:55 +0200

Hi. Re this:

> Date: Thu, 30 Oct 2008 17:24:39 -0300
> From: Jorge Stolfi <address@hidden>
> To: address@hidden
> Subject: [IC/Bugs] Uninterpreted byte ranges in REs 
>
> Dear gawk Maintaners,
>
> Not long ago, the intepretation of ranges like '[A-Z]' in gawk regular
> expressions was changed from plain byte-code order to locale-sensitive
> collating sequence order.

This is per POSIX; I do understand your problem though.

> While this change was probably welcome by many users, it unfortunately
> broke many existing scripts. Worse, there seems no decent way to get
> the old interpretation of RE ranges, in cases where it is needed.
>
> For example, here is some code from a script that used to work in
> 2005:
>
>   # Remove funny characters:
>   gsub(/[\001-\037\177-\240]/, " ", $0); # Controls, NBSP
>  
> The version of "gawk" that I am using now (GNU Awk 3.1.5)
> complains
>
>   gawk: myscript:12: fatal: Invalid collation character: /[-- ]/
>  
> Here is a minimal command line that triggers that error message:
>
>   gawk '/[\177-\240]/{ }' 
>   
> Here is a more meaningful example:
>   
>   echo "FOO @" | tr '@' '\203' | gawk '/[\177-\240]/{print;}' 
>
> The error message gets printed when LANG=C and LC_ALL=C, and also when
> LANG=POSIX and LC_ALL=POSIX. The "--traditional" switch makes no
> difference.

This is suprising. In particular, it works as expected for me
under Linux.  How are you setting LC_ALL?  If you're using Bash,
try

        export LC_ALL=C

as a standalone statement and then run your test.

What kind of system are you using? Linux, or some other Unix variant?
If so, how was gawk compiled?

> In general, the only way to get the old semantics seems to be to list all 
> octal codes in the desired range:
>
>   ( echo "FOO @"; echo "BAR %" ) | tr '@%' '\177\203' | 
>     gawk 
> '/[\177\200\201\202\203\204\205\206\207\210\211...\237\240]/{print;}' 

This works but it should not be necessary if you use LC_ALL=C.

I would prefer to find out why LC_ALL=C isn't working for you before trying
to modify gawk.

Thanks,

Arnold





reply via email to

[Prev in Thread] Current Thread [Next in Thread]