Re: "^.*$" not matching in sed with certain characters?

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "^.*$" not matching in sed with certain characters?

From:	Eric Blake
Subject:	Re: "^.*$" not matching in sed with certain characters?
Date:	Mon, 26 Sep 2011 17:22:58 -0600
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.22) Gecko/20110906 Fedora/3.1.14-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.14

[adding bug-grep]

On 09/26/2011 03:19 PM, Linus Lüssing wrote:

Hi,

I just wanted to use a regex in the following way:

$ ls tmp/*.zip | sed "s/^.*$/foo/"
tmp/Maniax Memori - L�%82�ktro Pöppe -- Jamendo - OGG Vorbis q7 - 2006.05.11 
[www.jamendo.com].zip

But I was a little startled by the output. If I remember
correctly, then "^.*$" should always match, shouldn't it? I would
have expected to get a "foo" back in this case.


Your problem is related to locales and multi-byte characters.

Regular expression operation is explicitly left undefined by POSIX whenyou are outside of the C single-byte locale, if you pass input that isnot a valid character sequence.

It looks like sed is interpreting '.' as the regular expression for anyvalid character, but that when your file name (which is NOT valid UTF-8)is passed through in a UTF-8 locale, the '.' cannot match the invalidbyte sequences, and thus, '^.*$' does not match that particular file name.


grep seems to match fine though:
$ ls tmp/*.zip | grep "^.*$"
tmp/Maniax Memori - L�%82�ktro Pöppe -- Jamendo - OGG Vorbis q7 - 2006.05.11 
[www.jamendo.com].zip

Which means grep has a different interpretation of regex than sed, inthat it is treating '.' as the regular expression that matches anycharacter _or invalid byte sequence_. POSIX permits bothinterpretations (since invalid byte sequences tend to be a corner casethat no one wants to standardize), so you don't have an actual bug here.But on the other hand, it would indeed be nice if GNU software wouldpresent a consistent front.

Unfortunately, I'm 50-50 on which behavior is better (letting '.' matchinvalid byte sequences, vs. matching only valid characters, when in amultibyte locale). SO I don't know whether sed or grep (or both!)should be patched.

In the meantime, you can always work around the issue by using LC_ALL=C,to force behavior into a single-byte locale where the behavior is bothwell-defined by POSIX and consistent between the two tools.


--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

"^.*$" not matching in sed with certain characters?, Linus Lüssing, 2011/09/26
- Re: "^.*$" not matching in sed with certain characters?, Eric Blake <=
  - Re: "^.*$" not matching in sed with certain characters?, Paolo Bonzini, 2011/09/27
- Re: "^.*$" not matching in sed with certain characters?, arnold, 2011/09/27

Prev by Date: "^.*$" not matching in sed with certain characters?
Next by Date: Re: "^.*$" not matching in sed with certain characters?
Previous by thread: "^.*$" not matching in sed with certain characters?
Next by thread: Re: "^.*$" not matching in sed with certain characters?
Index(es):
- Date
- Thread