bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[IC/Bugs] Uninterpreted byte ranges in REs


From: Jorge Stolfi
Subject: [IC/Bugs] Uninterpreted byte ranges in REs
Date: Thu, 30 Oct 2008 17:24:39 -0300

Dear gawk Maintaners,

Not long ago, the intepretation of ranges like '[A-Z]' in gawk regular
expressions was changed from plain byte-code order to locale-sensitive
collating sequence order.

While this change was probably welcome by many users, it unfortunately
broke many existing scripts. Worse, there seems no decent way to get
the old interpretation of RE ranges, in cases where it is needed.

For example, here is some code from a script that used to work in
2005:

  # Remove funny characters:
  gsub(/[\001-\037\177-\240]/, " ", $0); # Controls, NBSP
 
The version of "gawk" that I am using now (GNU Awk 3.1.5)
complains

  gawk: myscript:12: fatal: Invalid collation character: /[-- ]/
 
Here is a minimal command line that triggers that error message:

  gawk '/[\177-\240]/{ }' 
  
Here is a more meaningful example:
  
  echo "FOO @" | tr '@' '\203' | gawk '/[\177-\240]/{print;}' 

The error message gets printed when LANG=C and LC_ALL=C, and also when
LANG=POSIX and LC_ALL=POSIX. The "--traditional" switch makes no
difference.

In general, the only way to get the old semantics seems to be to list all 
octal codes in the desired range:

  ( echo "FOO @"; echo "BAR %" ) | tr '@%' '\177\203' | 
    gawk '/[\177\200\201\202\203\204\205\206\207\210\211...\237\240]/{print;}' 

Is there any way to get the old behavior of RE ranges, other than
expanding them by hand?

If not, would you consider implementing an "unsigned byte ranges mode"
where all RE ranges are interpreted using the unsigned 8-bit
byte ordering, \000 to \377, without regard to the locale ?
      
I suggest that this mode be selected by setting a new built-in variable
UNSIGNED_BYTE_RANGES to a nonzero value.
  
This feature would make it easy to fix old scripts without interfering
with other locale-related semantics (such as character classes,
collating elements, and equivalence classes).

A new built-in variable seems much better than a new command-line
swicth ("--unsigned-byte-ranges"). With the variable, one can easily
fix an old script by adding "UNSIGNED_BYTE_RANGES = 1" to the BEGIN
block. One may also set and reset the variable during the script's
execution, as needed.

With the command line option, one must find and fix all *uses*
of the script; and it affects all REs in the script. 

All the best,

--stolfi

Jorge Stolfi
Institute of Computing
State University of Campinas (UNICAMP)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]