bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Unexpected results with RS="."


From: Ed Morton
Subject: Re: [bug-gawk] Unexpected results with RS="."
Date: Mon, 11 Jun 2018 13:13:37 -0500
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

Thanks Arnold!

On 6/11/2018 12:44 PM, address@hidden wrote:
I have added a paragraph about this point and pushed it out to Git.

Thanks,

Arnold

Ed Morton <address@hidden> wrote:

Arnold - thanks for responding. I don't agree that is clear as that section
doesn't state that the 3 possibilities are considered in that order, it sounds
like they would just be mutually exclusive but of course they aren't when it
come to RS=".", so what happens in gawk when the single char is a regexp is
ambiguous if that's the only statement about the behavior, but in any case I
didn't even look at the Summary section as I expected to find everything I
needed related to this in the main section, 4.1 How Input Is Split into Records
(https://www.gnu.org/software/gawk/manual/gawk.html#Records).

Since a Summary should be just that I'd have expect this particular information
in section 4.14 should be summarized from section 4.1, not additional to it.
What's stated in 4.14 is fine as a summary, but not adequate if it's the ONLY
source of info on this. It also doesn't explain how to get an RS that means "any
single character" and IMHO that is non-obvious (embarrassingly, I had to ask at
comp.lang.awk where Janis helped me wrap my head around it as I was drawing a
blank!).

I see now there's a clear statement of the related behavior for FS in section
4.5 Specifying How Fields Are Separated
(https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):

     /If //|FS|//is any other single character, such as //|","|//, then each
     occurrence of that character separates two fields. Two consecutive
     occurrences delimit an empty field. If the character occurs at the 
beginning
     or the end of the line, that too delimits an empty field. The space
     character is the only single character that does not follow these rules./

I think RS deserves the equivalent explanation in section 4.1 plus the example
of using an RS that's any char (FS doesn't need it since there's no equivalent
to RT that's be useful in this case and FPAT="." works as you'd expect so
there's no use case for FS="." as a regexp).

  ?????? Ed.

On 6/11/2018 1:07 AM, address@hidden wrote:
Hi Ed.

The behavior is stated clearly, if tersely, in the summary section in the 
chapter
on reading input 
(https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):


        Input is split into records based on the value of RS. The possibilities 
are as follows:

        Value of RS             Records are split on ???                awk / 
gawk
        Any single character    That character                  awk
        The empty string ("") Runs of two or more newlines    awk
        A regexp                Text that matches the regexp    gawk

Thanks,

Arnold


Ed Morton <address@hidden> wrote:

I was recently surprised by this behavior from gawk 4.2.0:

   ???? $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
   ???? 1 <foo
   ???? :>

I came across this because I was trying to process data 1 char at a time and
thought setting RT to 1 char at a time might be a valid approach rather than
writing a loop. I'm not looking for alternatives, just wondering about this
specific functionality.

A little investigation shows that it behaves as if I'd used RS='[.]':

   ???? $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
   ???? 1 <foo:.>
   ???? 2 <bar
   ???? :>

I expected that RT would take the values f, o, o, \n and every $0 would be the
null string, analogous to what happens when you use 2 "."s:

   ???? $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
   ???? 1 <:fo>
   ???? 2 <:o
   ???? >

I assume it does this for compatibility with other awks where a single char RS
is always just that literal character but that seems counter-intuitive to the
way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS
to "any single character" given this implementation whereas if RS="." was
interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "."
just like we do for it in any other regexp context.

I've since discovered that I can get the behavior I want with `RS=".{1}"` or
`RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and 
non-intuitive.

I can't find anything in the gawk documentation that states that the above is
expected behavior. Assuming we can't update the code to treat RS="."?? as if "."
is a regexp metacharacter for backward compatibility, can we get a statement
saying something clear like "If RS is a single character it will be treated as a
literal character and not a regexp metacharacter" added to the documentation and
also the example of RS=".{1}" shown as a workaround for the case where the
desired regexp is "a single occurrence of any character"? I can't think of any
other regexp metacharacter that this issue would apply to.

   ???????? Ed.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]