branch-1_4 tokens vs. argument collection

m4-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch-1_4 tokens vs. argument collection

From:	Eric Blake
Subject:	branch-1_4 tokens vs. argument collection
Date:	Thu, 3 Aug 2006 03:25:59 +0000 (UTC)
User-agent:	Loom/3.14 (http://gmane.org/)
This patch cleans up token processing so that macro.c never looks at raw 
characters, and therefore, never consumes an entire quote or comment when it 
only meant to consume `(' as part of argument collection.

Note that with this patch, if comments currently start with `(', a call to 
changecom can see a comment from peek_token, process the changecom without 
arguments, then next_token will see the same input character as just a plain `
(' because the tokenization rules changed midstream (likewise with quotes 
starting with `(' and a call to changequote).

2006-08-02  Eric Blake  <address@hidden>

        Don't confuse leading `(' in comment or quote with start of
        argument collection.
        * src/m4.h (enum token_type): Add TOKEN_OPEN, TOKEN_COMMA,
        TOKEN_CLOSE.
        (peek_input): Make private to input.c.
        (peek_token): New prototype.
        * src/input.c (default_word_regexp): Reduce ifdefs.
        (peek_input): Make static.
        (next_token): Return new token types.
        (match_input, MATCH): Add argument consume, which controls
        whether match should be pushed back.
        (peek_token): New function.
        (token_type_string) [DEBUG_INPUT]: New function.
        * src/macro.c (expand_token, expand_argument, collect_arguments):
        Handle new token types.
        * doc/m4.texinfo (Changequote, Changecom): Document this.
        * NEWS: Document this.


Index: NEWS
===================================================================
RCS file: /sources/m4/m4/NEWS,v
retrieving revision 1.1.1.1.2.47
diff -u -r1.1.1.1.2.47 NEWS
--- NEWS        1 Aug 2006 13:05:45 -0000       1.1.1.1.2.47
+++ NEWS        2 Aug 2006 23:16:47 -0000
@@ -17,12 +17,14 @@
   collection.
 * The dnl macro now warns if end of file is encountered instead of a
   newline.
-* The error message when end of file is encountered now uses the file where
-  the dangling construct started, rather than "NONE:0".
+* The error message when end of file is encountered now uses the file and
+  line where the dangling construct started, rather than `NONE:0'.
 * The __file__ macro, and the -s/--synclines option, now show what
   directory a file was found in when the -I/--include option or M4PATH
   variable had an effect.
-* The changequote and changecom macros now work with 8-bit characters.
+* The changequote and changecom macros now work with 8-bit characters, and
+  quotes and strings that begin with `(' are properly recognized following
+  a word.
 
 Version 1.4.5 - 15 July 2006, by Eric Blake  (CVS version 1.4.4c)
 
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.1.1.1.2.59
diff -u -r1.1.1.1.2.59 m4.texinfo
--- doc/m4.texinfo      1 Aug 2006 13:05:45 -0000       1.1.1.1.2.59
+++ doc/m4.texinfo      2 Aug 2006 23:16:47 -0000
@@ -2420,6 +2420,37 @@
 @result{} hi  HI
 @end example
 
+Quotes are recognized in preference to argument collection.  In
+particular, if @var{start} is a single @samp{(}, then argument
+collection is effectively disabled.  For portability with other
+implementations, it is a good idea to avoid @samp{(}, @samp{,}, and
address@hidden)} as the first character in @var{start}.
+
address@hidden
+define(`echo', `$#:$@:')
address@hidden
+define(`hi', `HI')
address@hidden
+changequote(`(',`)')
address@hidden
+echo(hi)
address@hidden::hi
+changequote
address@hidden
+changequote(`((', `))')
address@hidden
+echo(hi)
address@hidden:HI:
+echo((hi))
address@hidden::hi
+changequote
address@hidden
+changequote(`,', `)')
address@hidden
+echo(hi,hi)bye)
address@hidden:HIhibye:
address@hidden example
+
 If @var{end} is a prefix of @var{start}, the end-quote will be
 recognized in preference to a nested begin-quote.  In particular,
 changing the quotes to have the same string for @var{start} and
@@ -2529,10 +2560,11 @@
 @end ignore
 
 Comments are recognized in preference to macros.  However, this is not
-compatible with other implementations, where macros take precedence over
-comments, so it may change in a future release.  For portability, this
-means that @var{start} should not begin with a letter or @samp{_}
-(underscore).
+compatible with other implementations, where macros and even quoting
+takes precedence over comments, so it may change in a future release.
+For portability, this means that @var{start} should not begin with a
+letter or @samp{_} (underscore), and that neither the start-quote nor
+the start-comment string should be a prefix of the other.
 
 @example
 define(`hi', `HI')
@@ -2543,6 +2575,35 @@
 @result{}q hi Q HI
 @end example
 
+Comments are recognized in preference to argument collection.  In
+particular, if @var{start} is a single @samp{(}, then argument
+collection is effectively disabled.  For portability with other
+implementations, it is a good idea to avoid @samp{(}, @samp{,}, and
address@hidden)} as the first character in @var{start}.
+
address@hidden
+define(`echo', `$#:$@:')
address@hidden
+define(`hi', `HI')
address@hidden
+changecom(`(',`)')
address@hidden
+echo(hi)
address@hidden::(hi)
+changecom
address@hidden
+changecom(`((', `))')
address@hidden
+echo(hi)
address@hidden:HI:
+echo((hi))
address@hidden::((hi))
+changecom(`,', `)')
address@hidden
+echo(hi,hi)bye)
address@hidden:HI,hi)bye:
address@hidden example
+
 It is an error if the end of file occurs within a comment.
 
 @example
Index: src/input.c
===================================================================
RCS file: /sources/m4/m4/src/Attic/input.c,v
retrieving revision 1.1.1.1.2.16
diff -u -r1.1.1.1.2.16 input.c
--- src/input.c 2 Aug 2006 15:11:58 -0000       1.1.1.1.2.16
+++ src/input.c 2 Aug 2006 23:16:47 -0000
@@ -140,14 +140,20 @@
 
 #ifdef ENABLE_CHANGEWORD
 
-#define DEFAULT_WORD_REGEXP "[_a-zA-Z][_a-zA-Z0-9]*"
+# define DEFAULT_WORD_REGEXP "[_a-zA-Z][_a-zA-Z0-9]*"
 
 static char *word_start;
 static struct re_pattern_buffer word_regexp;
 static int default_word_regexp;
 static struct re_registers regs;
 
-#endif /* ENABLE_CHANGEWORD */
+#else /* ! ENABLE_CHANGEWORD */
+# define default_word_regexp 1
+#endif /* ! ENABLE_CHANGEWORD */
+
+#ifdef DEBUG_INPUT
+static const char *token_type_string (token_type);
+#endif
 
 
 /*-------------------------------------------------------------------------.
@@ -229,7 +235,7 @@
     }
 
   next = (input_block *) obstack_alloc (current_input,
-                                       sizeof (struct input_block));
+                                       sizeof (struct input_block));
   next->type = INPUT_STRING;
   return current_input;
 }
@@ -278,7 +284,7 @@
 {
   input_block *i;
   i = (input_block *) obstack_alloc (wrapup_stack,
-                                     sizeof (struct input_block));
+                                    sizeof (struct input_block));
   i->prev = wsp;
   i->type = INPUT_STRING;
   i->u.u_s.string = obstack_copy0 (wrapup_stack, s, strlen (s));
@@ -309,16 +315,16 @@
                        isp->u.u_f.name, isp->u.u_f.lineno);
 
       if (ferror (isp->u.u_f.file))
-        {
-          M4ERROR ((warning_status, 0, "read error"));
-          fclose (isp->u.u_f.file);
-          retcode = EXIT_FAILURE;
-        }
+       {
+         M4ERROR ((warning_status, 0, "read error"));
+         fclose (isp->u.u_f.file);
+         retcode = EXIT_FAILURE;
+       }
       else if (fclose (isp->u.u_f.file) == EOF)
-        {
-          M4ERROR ((warning_status, errno, "error reading file"));
-          retcode = EXIT_FAILURE;
-        }
+       {
+         M4ERROR ((warning_status, errno, "error reading file"));
+         retcode = EXIT_FAILURE;
+       }
       current_file = isp->u.u_f.name;
       current_line = isp->u.u_f.lineno;
       output_current_line = isp->u.u_f.out_lineno;
@@ -409,7 +415,7 @@
 | input stack.                                                           |
 `------------------------------------------------------------------------*/
 
-int
+static int
 peek_input (void)
 {
   int ch;
@@ -536,36 +542,48 @@
 }
 
 
-/*----------------------------------------------------------------------.
-| This function is for matching a string against a prefix of the input  |
-| stream.  If the string matches the input, the input is discarded,     |
-| otherwise the characters read are pushed back again.  The function is |
-| used only when multicharacter quotes or comment delimiters are used.  |
-`----------------------------------------------------------------------*/
+/*------------------------------------------------------------------.
+| This function is for matching a string against a prefix of the    |
+| input stream.  If the string matches the input and consume is     |
+| TRUE, the input is discarded; otherwise any characters read are   |
+| pushed back again.  The function is used only when multicharacter |
+| quotes or comment delimiters are used.                            |
+`------------------------------------------------------------------*/
 
-static int
-match_input (const char *s)
+static boolean
+match_input (const char *s, boolean consume)
 {
   int n;                       /* number of characters matched */
   int ch;                      /* input character */
   const char *t;
+  boolean result = FALSE;
 
   ch = peek_input ();
   if (ch != to_uchar (*s))
-    return 0;                  /* fail */
-  (void) next_char ();
+    return FALSE;                      /* fail */
 
   if (s[1] == '\0')
-    return 1;                  /* short match */
+    {
+      if (consume)
+       (void) next_char ();
+      return TRUE;                     /* short match */
+    }
 
-  for (n = 1, t = s++; (ch = peek_input ()) == to_uchar (*s++); n++)
+  (void) next_char ();
+  for (n = 1, t = s++; (ch = peek_input ()) == to_uchar (*s++); )
     {
       (void) next_char ();
+      n++;
       if (*s == '\0')          /* long match */
-       return 1;
+       {
+         if (consume)
+           return TRUE;
+         result = TRUE;
+         break;
+       }
     }
 
-  /* Failed, push back input.  */
+  /* Failed or shouldn't consume, push back input.  */
   {
     struct obstack *h = push_string_init ();
 
@@ -573,20 +591,23 @@
     obstack_grow (h, t, n);
   }
   push_string_finish ();
-  return 0;
+  return result;
 }
 
-/*------------------------------------------------------------------------.
-| The macro MATCH() is used to match a string against the input.  The    |
-| first character is handled inline, for speed.  Hopefully, this will not |
-| hurt efficiency too much when single character quotes and comment      |
-| delimiters are used.                                                   |
-`------------------------------------------------------------------------*/
+/*--------------------------------------------------------------------.
+| The macro MATCH() is used to match a string S against the input.    |
+| The first character is handled inline, for speed.  Hopefully, this  |
+| will not hurt efficiency too much when single character quotes and  |
+| comment delimiters are used.  If CONSUME, then CH is the result of  |
+| next_char, and a successful match will discard the matched string.  |
+| Otherwise, CH is the result of peek_char, and the input stream is   |
+| effectively unchanged.                                              |
+`--------------------------------------------------------------------*/
 
-#define MATCH(ch, s) \
+#define MATCH(ch, s, consume)                                           \
   (to_uchar ((s)[0]) == (ch)                                            \
    && (ch) != '\0'                                                      \
-   && ((s)[1] == '\0' || (match_input ((s) + 1))))
+   && ((s)[1] == '\0' || (match_input ((s) + (consume), consume))))
 
 
 /*----------------------------------------------------------.
@@ -770,16 +791,17 @@
       (void) next_char ();
 #ifdef DEBUG_INPUT
       fprintf (stderr, "next_token -> MACDEF (%s)\n",
-               find_builtin_by_addr (TOKEN_DATA_FUNC (td))->name);
+              find_builtin_by_addr (TOKEN_DATA_FUNC (td))->name);
 #endif
       return TOKEN_MACDEF;
     }
 
   (void) next_char ();
-  if (MATCH (ch, bcomm.string))
+  if (MATCH (ch, bcomm.string, TRUE))
     {
       obstack_grow (&token_stack, bcomm.string, bcomm.length);
-      while ((ch = next_char ()) != CHAR_EOF && !MATCH (ch, ecomm.string))
+      while ((ch = next_char ()) != CHAR_EOF
+            && !MATCH (ch, ecomm.string, TRUE))
        obstack_1grow (&token_stack, ch);
       if (ch != CHAR_EOF)
        obstack_grow (&token_stack, ecomm.string, ecomm.length);
@@ -791,11 +813,7 @@
 
       type = TOKEN_STRING;
     }
-#ifdef ENABLE_CHANGEWORD
   else if (default_word_regexp && (isalpha (ch) || ch == '_'))
-#else
-  else if (isalpha (ch) || ch == '_')
-#endif
     {
       obstack_1grow (&token_stack, ch);
       while ((ch = peek_input ()) != CHAR_EOF && (isalnum (ch) || ch == '_'))
@@ -812,7 +830,7 @@
     {
       obstack_1grow (&token_stack, ch);
       while (1)
-        {
+       {
          ch = peek_input ();
          if (ch == CHAR_EOF)
            break;
@@ -844,9 +862,23 @@
 
 #endif /* ENABLE_CHANGEWORD */
 
-  else if (!MATCH (ch, lquote.string))
+  else if (!MATCH (ch, lquote.string, TRUE))
     {
-      type = TOKEN_SIMPLE;
+      switch (ch)
+       {
+       case '(':
+         type = TOKEN_OPEN;
+         break;
+       case ',':
+         type = TOKEN_COMMA;
+         break;
+       case ')':
+         type = TOKEN_CLOSE;
+         break;
+       default:
+         type = TOKEN_SIMPLE;
+         break;
+       }
       obstack_1grow (&token_stack, ch);
     }
   else
@@ -861,13 +893,13 @@
            error_at_line (EXIT_FAILURE, 0, file, line,
                           "ERROR: end of file in string");
 
-         if (MATCH (ch, rquote.string))
+         if (MATCH (ch, rquote.string, TRUE))
            {
              if (--quote_level == 0)
                break;
              obstack_grow (&token_stack, rquote.string, rquote.length);
            }
-         else if (MATCH (ch, lquote.string))
+         else if (MATCH (ch, lquote.string, TRUE))
            {
              quote_level++;
              obstack_grow (&token_stack, lquote.string, lquote.length);
@@ -888,20 +920,127 @@
   TOKEN_DATA_ORIG_TEXT (td) = orig_text;
 #endif
 #ifdef DEBUG_INPUT
-  fprintf (stderr, "next_token -> %d (%s)\n", type, TOKEN_DATA_TEXT (td));
+  fprintf (stderr, "next_token -> %s (%s)\n",
+          token_type_string (type), TOKEN_DATA_TEXT (td));
 #endif
   return type;
 }
+
+/*-----------------------------------------------.
+| Peek at the next token from the input stream.  |
+`-----------------------------------------------*/
+
+token_type
+peek_token (void)
+{
+  int ch = peek_input ();
+
+  if (ch == CHAR_EOF)
+    {
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> EOF\n");
+#endif
+      return TOKEN_EOF;
+    }
+  if (ch == CHAR_MACRO)
+    {
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> MACDEF\n");
+#endif
+      return TOKEN_MACDEF;
+    }
+
+  if (MATCH (ch, bcomm.string, FALSE))
+    {
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> COMMENT\n");
+#endif
+      return TOKEN_STRING;
+    }
+
+  if ((default_word_regexp && (isalpha (ch) || ch == '_'))
+#ifdef ENABLE_CHANGEWORD
+      || (! default_word_regexp && strchr (word_start, ch))
+#endif /* ENABLE_CHANGEWORD */
+      )
+    {
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> WORD\n");
+#endif
+      return TOKEN_WORD;
+    }
+
+  if (MATCH (ch, lquote.string, FALSE))
+    {
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> QUOTE\n");
+#endif
+      return TOKEN_STRING;
+    }
+
+  switch (ch)
+    {
+    case '(':
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> OPEN\n");
+#endif
+      return TOKEN_OPEN;
+    case ',':
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> COMMA\n");
+#endif
+      return TOKEN_COMMA;
+    case ')':
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> CLOSE\n");
+#endif
+      return TOKEN_CLOSE;
+    default:
+#ifdef DEBUG_INPUT
+      fprintf (stderr, "peek_token -> SIMPLE\n");
+#endif
+      return TOKEN_SIMPLE;
+    }
+}
 
 
 #ifdef DEBUG_INPUT
 
+static const char *
+token_type_string (token_type t)
+{
+ switch (t)
+    {                          /* TOKSW */
+    case TOKEN_EOF:
+      return "EOF";
+    case TOKEN_STRING:
+      return "STRING";
+    case TOKEN_WORD:
+      return "WORD";
+    case TOKEN_OPEN:
+      return "OPEN";
+    case TOKEN_COMMA:
+      return "COMMA";
+    case TOKEN_CLOSE:
+      return "CLOSE";
+    case TOKEN_SIMPLE:
+      return "SIMPLE";
+    case TOKEN_MACDEF:
+      return "MACDEF";
+    default:
+      abort ();
+    }
+ }
+
 static void
 print_token (const char *s, token_type t, token_data *td)
 {
   fprintf (stderr, "%s: ", s);
   switch (t)
     {                          /* TOKSW */
+    case TOKEN_OPEN:
+    case TOKEN_COMMA:
+    case TOKEN_CLOSE:
     case TOKEN_SIMPLE:
       fprintf (stderr, "char:");
       break;
Index: src/m4.h
===================================================================
RCS file: /sources/m4/m4/src/m4.h,v
retrieving revision 1.1.1.1.2.23
diff -u -r1.1.1.1.2.23 m4.h
--- src/m4.h    30 Jul 2006 23:46:51 -0000      1.1.1.1.2.23
+++ src/m4.h    2 Aug 2006 23:16:47 -0000
@@ -219,10 +219,13 @@
 enum token_type
 {
   TOKEN_EOF,                   /* end of file */
-  TOKEN_STRING,                        /* a quoted string */
+  TOKEN_STRING,                        /* a quoted string or comment */
   TOKEN_WORD,                  /* an identifier */
-  TOKEN_SIMPLE,                        /* a single character */
-  TOKEN_MACDEF                 /* a macros definition (see "defn") */
+  TOKEN_OPEN,                  /* ( */
+  TOKEN_COMMA,                 /* , */
+  TOKEN_CLOSE,                 /* ) */
+  TOKEN_SIMPLE,                        /* any other single character */
+  TOKEN_MACDEF                 /* a macro's definition (see "defn") */
 };
 
 /* The data for a token, a macro argument, and a macro definition.  */
@@ -262,7 +265,7 @@
 typedef enum token_data_type token_data_type;
 
 void input_init (void);
-int peek_input (void);
+token_type peek_token (void);
 token_type next_token (token_data *);
 void skip_line (void);
 
Index: src/macro.c
===================================================================
RCS file: /sources/m4/m4/src/Attic/macro.c,v
retrieving revision 1.1.1.1.2.8
diff -u -r1.1.1.1.2.8 macro.c
--- src/macro.c 1 Aug 2006 13:05:45 -0000       1.1.1.1.2.8
+++ src/macro.c 2 Aug 2006 23:16:47 -0000
@@ -66,6 +66,9 @@
     case TOKEN_MACDEF:
       break;
 
+    case TOKEN_OPEN:
+    case TOKEN_COMMA:
+    case TOKEN_CLOSE:
     case TOKEN_SIMPLE:
     case TOKEN_STRING:
       shipout_text (obs, TOKEN_DATA_TEXT (td), strlen (TOKEN_DATA_TEXT (td)));
@@ -76,7 +79,7 @@
       if (sym == NULL || SYMBOL_TYPE (sym) == TOKEN_VOID
          || (SYMBOL_TYPE (sym) == TOKEN_FUNC
              && SYMBOL_BLIND_NO_ARGS (sym)
-             && peek_input () != '('))
+             && peek_token () != TOKEN_OPEN))
        {
 #ifdef ENABLE_CHANGEWORD
          shipout_text (obs, TOKEN_DATA_ORIG_TEXT (td),
@@ -134,11 +137,10 @@
 
       switch (t)
        {                       /* TOKSW */
-       case TOKEN_SIMPLE:
-         text = TOKEN_DATA_TEXT (&td);
-         if ((*text == ',' || *text == ')') && paren_level == 0)
+       case TOKEN_COMMA:
+       case TOKEN_CLOSE:
+         if (paren_level == 0)
            {
-
              /* The argument MUST be finished, whether we want it or not.  */
              obstack_1grow (obs, '\0');
              text = obstack_finish (obs);
@@ -148,8 +150,12 @@
                  TOKEN_DATA_TYPE (argp) = TOKEN_TEXT;
                  TOKEN_DATA_TEXT (argp) = text;
                }
-             return (boolean) (*TOKEN_DATA_TEXT (&td) == ',');
+             return (boolean) (t == TOKEN_COMMA);
            }
+         /* fallthru */
+       case TOKEN_OPEN:
+       case TOKEN_SIMPLE:
+         text = TOKEN_DATA_TEXT (&td);
 
          if (*text == '(')
            paren_level++;
@@ -198,7 +204,6 @@
 collect_arguments (symbol *sym, struct obstack *argptr,
                   struct obstack *arguments)
 {
-  int ch;                      /* lookahead for ( */
   token_data td;
   token_data *tdp;
   boolean more_args;
@@ -209,8 +214,7 @@
   tdp = (token_data *) obstack_copy (arguments, &td, sizeof (td));
   obstack_grow (argptr, &tdp, sizeof (tdp));
 
-  ch = peek_input ();
-  if (ch == '(')
+  if (peek_token () == TOKEN_OPEN)
     {
       next_token (&td);                /* gobble parenthesis */
       do
[Prev in Thread]
Current Thread
[Next in Thread]
branch-1_4 tokens vs. argument collection, Eric Blake <=
Prev by Date: Re: branch-1_4 EOF issues
Next by Date: branch-1_4 debian bug 154053 - ENOSYS in sigstack
Previous by thread: Re: branch-1_4 EOF issues
Next by thread: branch-1_4 debian bug 154053 - ENOSYS in sigstack
Index(es):
- Date
- Thread