bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "^.*$" not matching in sed with certain characters?


From: Eric Blake
Subject: Re: "^.*$" not matching in sed with certain characters?
Date: Mon, 26 Sep 2011 17:22:58 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.22) Gecko/20110906 Fedora/3.1.14-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.14

[adding bug-grep]

On 09/26/2011 03:19 PM, Linus Lüssing wrote:
Hi,

I just wanted to use a regex in the following way:

$ ls tmp/*.zip | sed "s/^.*$/foo/"
tmp/Maniax Memori - L�%82�ktro Pöppe -- Jamendo - OGG Vorbis q7 - 2006.05.11 
[www.jamendo.com].zip

But I was a little startled by the output. If I remember
correctly, then "^.*$" should always match, shouldn't it? I would
have expected to get a "foo" back in this case.

Your problem is related to locales and multi-byte characters.

Regular expression operation is explicitly left undefined by POSIX when you are outside of the C single-byte locale, if you pass input that is not a valid character sequence.

It looks like sed is interpreting '.' as the regular expression for any valid character, but that when your file name (which is NOT valid UTF-8) is passed through in a UTF-8 locale, the '.' cannot match the invalid byte sequences, and thus, '^.*$' does not match that particular file name.


grep seems to match fine though:
$ ls tmp/*.zip | grep "^.*$"
tmp/Maniax Memori - L�%82�ktro Pöppe -- Jamendo - OGG Vorbis q7 - 2006.05.11 
[www.jamendo.com].zip

Which means grep has a different interpretation of regex than sed, in that it is treating '.' as the regular expression that matches any character _or invalid byte sequence_. POSIX permits both interpretations (since invalid byte sequences tend to be a corner case that no one wants to standardize), so you don't have an actual bug here. But on the other hand, it would indeed be nice if GNU software would present a consistent front.

Unfortunately, I'm 50-50 on which behavior is better (letting '.' match invalid byte sequences, vs. matching only valid characters, when in a multibyte locale). SO I don't know whether sed or grep (or both!) should be patched.

In the meantime, you can always work around the issue by using LC_ALL=C, to force behavior into a single-byte locale where the behavior is both well-defined by POSIX and consistent between the two tools.

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]