m4-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: branch-1_4 off-by-one in line reporting


From: Eric Blake
Subject: Re: branch-1_4 off-by-one in line reporting
Date: Tue, 17 Oct 2006 22:47:16 +0000 (UTC)
User-agent: Loom/3.14 (http://gmane.org/)

Hi, Gary,

Gary V. Vaughan <gary <at> gnu.org> writes:

> 
> On 12 Oct 2006, at 16:13, Eric Blake wrote:
> > I still want CVS head to follow Solaris' parsing precedence
> > rules (macros, then quotes, then comments), rather than the current  
> > behavior
> > (comments, macros, quotes).
> 
> Can you remind me why that is?  The first thing that happens in any
> parser I'm familiar with is to discard the comments, why is it a good
> thing for M4 to behave differently?  (I think I know an answer, but
> I'm curious to understand your reasoning here)

Most languages have the (rather nice) property that you cannot confuse comments 
with other tokens.  M4, on the other hand, thanks to changequote and changecom, 
can be placed into a position where it is ambiguous whether the parser should 
recognize the current character as the start of a macro or the start of a 
comment.  (Fortunately for changesyntax, we document that syntax designations 
are mutually exclusive - you cannot use changesyntax to simultaneously make a 
character both a letter and a comment start.)  The dilemma is not that macros 
are not discarded without expanding macros inside the comment, so much as 
recognizing what constitutes a comment.

I guess an analog to this dilemma is the C89 vs. C99 parse question:
int i = 1 //*
//*/
-1; /* Is i 0 or -1? */

In C89, there are no // comments, so it parses as 'int i = 1 / <comment> -1;', 
giving -1.  In C99, the parser sees 'int i = 1 <comment> <comment> -1;', giving 
an answer of 0.  Because C99 changed the comment syntax to allow an additional 
form, it is possible to encounter (admittedly unusual) test cases that can 
expose the difference.

Now, for a concrete example in m4.
$ /usr/xpg4/bin/m4
define(a,A)define(a1a2a,b)changecom(1,2)a1a2a
b
a 1 a 2 a
A 1 a 2 A
$ 

Here, both Solaris and GNU agree - once you start parsing a macro name, you 
greedily consume as many additional characters as fit in a name, even if you 
could otherwise recognize a comment or quote were you to not be greedy.

$ /usr/xpg4/bin/m4
define(a,A)define(b,B)changequote(`a',c) a b c
 A B c
$

Again, both implementations agree - the a is recognized as a macro name and 
expanded to A, and not reconized as a quote start, so b gets expanded and all 
three letters printed.

$ /usr/xpg4/bin/m4
define(a,A)define(b,B)changecom(`a',]) a b ]
 A B ]
$ m4
define(a,A)define(b,B)changecom(`a',]) a b ]
 a b ]
$

Hmm, now we have a difference.  Solaris said that 'a' matches a macro name, so 
expand it to A, at which point there is no comment recognized and b gets 
expanded.  GNU 1.4.x said that 'a' matches the comment start string, so look 
for ], and everything in between, including 'b', is output untouched.

$ /usr/xpg4/bin/m4
changecom(`[[[',`]]]')changequote(`[[',`]]')define(a,A)

[[a]]
a
[[[a]]]
[a]
changequote changecom changecom(`[[',`]]')changequote(`[[[',`]]]')
  
[[a]]
[[a]]
[[[a]]]
a
$ m4
changecom(`[[[',`]]]')changequote(`[[',`]]')define(a,A)

[[a]]
a
[[[a]]]
[[[a]]]
changequote changecom changecom(`[[',`]]')changequote(`[[[',`]]]')
  
[[a]]
[[a]]
[[[a]]]
[[[a]]]
$

Hmm, in Solaris, when the prefix was ambiguous between quote and comment, it 
always chose quote when given a chance, even when quote was the shorter 
prefix.  In GNU, on the other hand, the comment was always recognized first.  
If either implementation were a strictly greedy parser, then you would expect 
the longer start token to be recognized in preference to the shorter one.

POSIX does not explicitly document precedence in m4 between the three types of 
tokens.  However, it does document macros, then quotes, then comments, which is 
the same precedence that Solaris uses.  The only time it should matter is if 
comments and quotes share a common prefix; or if comments and/or quotes start 
with a letter or underscore.  If anything, the reason I am proposing delaying 
the recognition of comments until after macro names and quote starts have been 
recognized is to match historical behavior, and so that GNU M4 parsing at least 
follows the order that the three token types are mentioned in POSIX.

-- 
Eric Blake






reply via email to

[Prev in Thread] Current Thread [Next in Thread]