1.4.10b regression in comment parsing

m4-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

1.4.10b regression in comment parsing

From:	Eric Blake
Subject:	1.4.10b regression in comment parsing
Date:	Sun, 03 Aug 2008 18:25:06 -0600
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.16) Gecko/20080708 Thunderbird/2.0.0.16 Mnenhy/0.7.5.666

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

While trying to track down why an m4sugar expression was quadratic if I
passed the argument -1-, but linear when I passed [-]1[-], I discovered a
regression introduced in argv_ref stage 15, due to a latent bug in stage
5.  This means that the beta release 1.4.10b suffers from the bug.  In
addition to fixing the regression, this patch also solves the quadratic
performance.  In short, since `-' is not a special character to m4, it
should not trip the heuristics of arguments that require a O(n) rescan of
quoted $@ instead of an O(1) back-reference (repeat that over n iterations
of a $@ recursion, and you have quadratic vs. linear).  But comments were
never tripping the heuristic, even when they contained unbalanced quotes.
~ For an example of the bug, where unbalanced quotes inside comments indeed
affect rescanning:

define(`e', `$@')dnl
changecom(`<', `>')dnl
define(`n', `$#')dnl
n(e(<`>, <'>))

should output 1, not 2, even though e was given two arguments, because the
unbalanced quotes within those two arguments result in a nested quote
around the comma.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiWTGIACgkQ84KuGfSFAYBmLgCfR2JqgP5Y9B8Pygrypb4sJgvp
KkMAn0/3MKTKqnBrKwHU0jkOEWm9+8pR
=GQjM
-----END PGP SIGNATURE-----

>From 0d6fb01e76bc35550a00cbf7710d1471db9e7b00 Mon Sep 17 00:00:00 2001
From: Eric Blake <address@hidden>
Date: Sun, 3 Aug 2008 14:23:19 -0600
Subject: [PATCH] Fix regression in commenting unbalanced quotes, from 
2008-02-16.

* src/m4.h (enum token_type): Add TOKEN_COMMENT.
* src/input.c (next_token, peek_token, token_type_string)
(print_token): Supply new token type for comments.
* src/macro.c (expand_token): Remove penalty for unquoted `-'
bytes.  Penalize comments, as they can contain unbalanced quotes;
latent bug since 2007-12-07, exposed by passing $@ references
built from comments.
(expand_argument): Adjust caller.
* doc/m4.texinfo (Comments): Test the fix.
* NEWS: Mention the fix.

Signed-off-by: Eric Blake <address@hidden>
---
 ChangeLog      |   14 ++++++++++++++
 NEWS           |   14 +++++++-------
 doc/m4.texinfo |   21 +++++++++++++++++++++
 src/input.c    |   36 ++++++++++++++++++++++--------------
 src/m4.h       |    3 ++-
 src/macro.c    |   49 +++++++++++++++++++++++++++++++------------------
 6 files changed, 97 insertions(+), 40 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index d4f182e..325bf7a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,17 @@
+2008-08-03  Eric Blake  <address@hidden>
+
+       Fix regression in commenting unbalanced quotes, from 2008-02-16.
+       * src/m4.h (enum token_type): Add TOKEN_COMMENT.
+       * src/input.c (next_token, peek_token, token_type_string)
+       (print_token): Supply new token type for comments.
+       * src/macro.c (expand_token): Remove penalty for unquoted `-'
+       bytes.  Penalize comments, as they can contain unbalanced quotes;
+       latent bug since 2007-12-07, exposed by passing $@ references
+       built from comments.
+       (expand_argument): Adjust caller.
+       * doc/m4.texinfo (Comments): Test the fix.
+       * NEWS: Mention the fix.
+
 2008-07-30  Eric Blake  <address@hidden>
 
        Fix regression in trace output, introduced 2008-05-09.
diff --git a/NEWS b/NEWS
index bea5f07..fe6f9e8 100644
--- a/NEWS
+++ b/NEWS
@@ -9,13 +9,13 @@ Foundation, Inc.
    a macro.  This was most noticeable with `traceon(`traceon')', but
    would also happen in cases such as `foo(traceon(`foo'))'.
 
-** Fix regression introduced in 1.4.10b (but not present in 1.4.11) where
-   using `builtin' or `indir' to perform nested `shift' calls triggered an
-   assertion failure.
-
-** Fix regression introduced in 1.4.10b (but not present in 1.4.11) where
-   the command-line option -dV, as well as the builtin `debugmode(V)',
-   failed to enable `t' and `c' debug options.
+** Fix regressions introduced in 1.4.10b but not present in 1.4.11:
+*** Using `builtin' or `indir' to perform nested `shift' calls triggered
+    an assertion failure.
+*** The command-line option -dV, as well as the builtin `debugmode(V)',
+    failed to enable `t' and `c' debug options.
+*** Comments that contain unbalanced quotes were not rescanned correctly
+    when passed through address@hidden
 
 ** Fix the `m4wrap' builtin to accumulate wrapped text in FIFO order, as
    required by POSIX.  The manual mentions a way to restore the LIFO order
diff --git a/doc/m4.texinfo b/doc/m4.texinfo
index abacef9..d8e2625 100644
--- a/doc/m4.texinfo
+++ b/doc/m4.texinfo
@@ -1038,6 +1038,27 @@ The comment delimiters can be changed to any string at 
any time, using
 the builtin macro @code{changecom}.  @xref{Changecom}, for more
 information.
 
address@hidden
address@hidden Detect regression in 1.4.10b in regards to reparsing comments.
address@hidden Not worth including in the manual.
address@hidden
+define(`e', `$@@')define(`q', ``$@@'')define(`foo', `bar')
address@hidden
+q(e(`one
+',#two ' foo
+))
address@hidden
address@hidden',`#two  bar
address@hidden''
+changecom(`<', `>')define(`n', `$#')
address@hidden
+n(e(<`>, <'>))
address@hidden
+len(e(<`>, ,<'>))
address@hidden
address@hidden example
address@hidden ignore
+
 @node Other tokens
 @section Other kinds of input tokens
 
diff --git a/src/input.c b/src/input.c
index 0d08215..4f969b7 100644
--- a/src/input.c
+++ b/src/input.c
@@ -1590,17 +1590,18 @@ quote_cache (struct obstack *obs, unsigned int age, 
const string_pair *quotes)
 /*--------------------------------------------------------------------.
 | Parse a single token from the input stream, set TD to its          |
 | contents, and return its type.  A token is TOKEN_EOF if the        |
-| input_stack is empty; TOKEN_STRING for a quoted string or comment;  |
-| TOKEN_WORD for something that is a potential macro name; and       |
-| TOKEN_SIMPLE for any single character that is not a part of any of  |
-| the previous types.  If LINE is not NULL, set *LINE to the line     |
-| where the token starts.  If OBS is not NULL, expand TOKEN_STRING    |
-| directly into OBS rather than in token_stack temporary storage      |
-| area, and TD could be a TOKEN_COMP instead of the usual            |
-| TOKEN_TEXT.  If ALLOW_ARGV, OBS must be non-NULL, and an entire     |
-| series of arguments can be returned as TOKEN_ARGV when a $@        |
-| reference is encountered.  Report errors (unterminated comments or  |
-| strings) on behalf of CALLER, if non-NULL.                         |
+| input_stack is empty; TOKEN_STRING for a quoted string;            |
+| TOKEN_COMMENT for a comment; TOKEN_WORD for something that is a     |
+| potential macro name; and TOKEN_SIMPLE for any single character     |
+| that is not a part of any of the previous types.  If LINE is not    |
+| NULL, set *LINE to the line where the token starts.  If OBS is not  |
+| NULL, expand TOKEN_STRING and TOKEN_COMMENT directly into OBS              |
+| rather than in token_stack temporary storage area, and TD could be  |
+| a TOKEN_COMP instead of the usual TOKEN_TEXT.  If ALLOW_ARGV, OBS   |
+| must be non-NULL, and an entire series of arguments can be         |
+| returned as TOKEN_ARGV when a $@ reference is encountered.  Report  |
+| errors (unterminated comments or strings) on behalf of CALLER, if   |
+| non-NULL.                                                          |
 |                                                                    |
 | Next_token () returns the token type, and passes back a pointer to  |
 | the token data through TD.  Non-string token text is collected on   |
@@ -1695,7 +1696,7 @@ next_token (token_data *td, int *line, struct obstack 
*obs, bool allow_argv,
          assert (ch < CHAR_EOF);
          obstack_1grow (obs_td, ch);
        }
-      type = TOKEN_STRING;
+      type = TOKEN_COMMENT;
     }
   else if (default_word_regexp && (isalpha (ch) || ch == '_'))
     {
@@ -1837,7 +1838,8 @@ next_token (token_data *td, int *line, struct obstack 
*obs, bool allow_argv,
     }
   else
     {
-      assert (TOKEN_DATA_TYPE (td) == TOKEN_COMP && type == TOKEN_STRING);
+      assert (TOKEN_DATA_TYPE (td) == TOKEN_COMP
+             && (type == TOKEN_STRING || type == TOKEN_COMMENT));
 #ifdef DEBUG_INPUT
       {
        token_chain *chain;
@@ -1895,7 +1897,7 @@ peek_token (void)
     }
   else if (MATCH (ch, curr_comm.str1, curr_comm.len1, false))
     {
-      result = TOKEN_STRING;
+      result = TOKEN_COMMENT;
     }
   else if ((default_word_regexp && (isalpha (ch) || ch == '_'))
 #ifdef ENABLE_CHANGEWORD
@@ -1943,6 +1945,8 @@ token_type_string (token_type t)
       return "EOF";
     case TOKEN_STRING:
       return "STRING";
+    case TOKEN_COMMENT:
+      return "COMMENT";
     case TOKEN_WORD:
       return "WORD";
     case TOKEN_OPEN:
@@ -1981,6 +1985,10 @@ print_token (const char *s, token_type t, token_data *td)
       xfprintf (stderr, "string:");
       break;
 
+    case TOKEN_COMMENT:
+      xfprintf (stderr, "comment:");
+      break;
+
     case TOKEN_MACDEF:
       xfprintf (stderr, "macro: %p\n", TOKEN_DATA_FUNC (td));
       break;
diff --git a/src/m4.h b/src/m4.h
index ff0377a..40aa5ec 100644
--- a/src/m4.h
+++ b/src/m4.h
@@ -218,7 +218,8 @@ typedef struct token_chain token_chain;
 enum token_type
 {
   TOKEN_EOF = 4,/* End of file, TOKEN_VOID.  */
-  TOKEN_STRING,        /* Quoted string or comment, TOKEN_TEXT or TOKEN_COMP.  
*/
+  TOKEN_STRING,        /* Quoted string, TOKEN_TEXT or TOKEN_COMP.  */
+  TOKEN_COMMENT,/* Comment, TOKEN_TEXT or TOKEN_COMP.  */
   TOKEN_WORD,  /* An identifier, TOKEN_TEXT.  */
   TOKEN_OPEN,  /* Active character `(', TOKEN_TEXT.  */
   TOKEN_COMMA, /* Active character `,', TOKEN_TEXT.  */
diff --git a/src/macro.c b/src/macro.c
index 0b57436..9d8ffbb 100644
--- a/src/macro.c
+++ b/src/macro.c
@@ -260,8 +260,7 @@ expand_token (struct obstack *obs, token_type t, token_data 
*td, int line,
              bool first)
 {
   symbol *sym;
-  bool result;
-  int ch;
+  bool result = false;
 
   switch (t)
     {                          /* TOKSW */
@@ -271,13 +270,19 @@ expand_token (struct obstack *obs, token_type t, 
token_data *td, int line,
       return true;
 
     case TOKEN_STRING:
-      /* Tokens and comments are safe in isolation (since quote_age()
-        detects any change in delimiters).  But if other text is
-        already present, multi-character delimiters could be an
-        issue, so use a conservative heuristic.  If obstack is
-        provided, the string was already expanded into it during
-        next_token.  */
+      /* Strings are safe in isolation (since quote_age() detects any
+        change in delimiters), or when safe_quotes is true.  When
+        safe_quotes is false, we could technically return true if we
+        can prove that the concatenation of this string to prior text
+        does not form a multi-byte quote delimiter, but that is a lot
+        of overhead, so we give the conservative answer of false.  */
       result = first || safe_quotes ();
+      /* fallthru */
+    case TOKEN_COMMENT:
+      /* Comments can contain unbalanced quote delimiters.  Rather
+        than search for one, we return the conservative answer of
+        false.  If obstack is provided, the string or comment was
+        already expanded into it during next_token.  */
       if (obs)
        return result;
       break;
@@ -285,18 +290,23 @@ expand_token (struct obstack *obs, token_type t, 
token_data *td, int line,
     case TOKEN_OPEN:
     case TOKEN_COMMA:
     case TOKEN_CLOSE:
-      /* Conservative heuristic; thanks to multi-character delimiter
-        concatenation.  */
+      /* If safe_quotes is true, then these do not form a quote
+        delimiter.  If it is false, we give the conservative answer
+        of false rather than taking time to prove that no multi-byte
+        quote delimiter is formed.  */
       result = safe_quotes ();
       break;
 
     case TOKEN_SIMPLE:
-      /* Conservative heuristic; if these characters are whitespace or
-        numeric, then behavior of safe_quotes is applicable.
-        Otherwise, assume these characters have a high likelihood of
-        use in quote delimiters.  */
-      ch = to_uchar (*TOKEN_DATA_TEXT (td));
-      result = (isspace (ch) || isdigit (ch)) && safe_quotes ();
+      /* If safe_quotes is true, then all but the single-byte end
+        quote delimiter is safe in a quoted context; a single-byte
+        start delimiter will trigger a TOKEN_STRING instead.  If
+        safe_quotes is false, we give the conservative answer of
+        false rather than taking time to prove that no multi-byte
+        quote delimiter is formed.  */
+      result = *TOKEN_DATA_TEXT (td) != *curr_quote.str2 && safe_quotes ();
+      if (result)
+       assert (*TOKEN_DATA_TEXT (td) != *curr_quote.str1);
       break;
 
     case TOKEN_WORD:
@@ -313,8 +323,10 @@ expand_token (struct obstack *obs, token_type t, 
token_data *td, int line,
 #else
          divert_text (obs, TOKEN_DATA_TEXT (td), TOKEN_DATA_LEN (td), line);
 #endif /* !ENABLE_CHANGEWORD */
-         /* The word just appended is unquoted, but the heuristics of
-            safe_quote are applicable.  */
+         /* If safe_quotes is true, then words do not overlap with
+            quote delimiters.  If it is false, we give the
+            conservative answer of false rather than prove that no
+            multi-byte delimiters are formed.  */
          return safe_quotes();
        }
       expand_macro (sym);
@@ -420,6 +432,7 @@ expand_argument (struct obstack *obs, token_data *argp,
 
        case TOKEN_WORD:
        case TOKEN_STRING:
+       case TOKEN_COMMENT:
        case TOKEN_MACDEF:
          if (!expand_token (obs, t, &td, line, first))
            age = 0;
-- 
1.5.6.4

[Prev in Thread]

Current Thread

[Next in Thread]

1.4.10b regression in comment parsing, Eric Blake <=

Prev by Date: increase limit when testing stack overflow
Next by Date: argv_ref patch 26: allow NUL in macro definitions
Previous by thread: increase limit when testing stack overflow
Next by thread: argv_ref patch 26: allow NUL in macro definitions
Index(es):
- Date
- Thread