m4-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

translit speedup


From: Eric Blake
Subject: translit speedup
Date: Tue, 31 Oct 2006 07:05:20 -0700
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Thunderbird/1.5.0.7 Mnenhy/0.7.4.666

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I noticed that translit was quadratic - consider translit(a<5000 b's>,
<5000 a's>b, c<5000 d's>) - it was doing more than 25 million comparisons
to result in c<5000 d's>.  With this patch, that is reduced to several
thousand comparisons.  It also fixes index to use strstr - partly because
our implementation was not as efficient as gnulib's implementation for
single-byte locales (but the improvement is not as drastic as translit's),
and partly because it is one step closer to proper locale support (but
right now, supporting locales is a post 2.0 goal for me).

branch:
2006-10-31  Eric Blake  <address@hidden>

        * m4/gnulib-cache.m4: Augment with 'gnulib-tool --import strstr'.
        * doc/m4.texinfo (Translit): Improve the documentation.
        * src/builtin.c (m4_translit): Optimize to O(n) instead of O(n^2)
        algorithm.
        (m4_index): Simplify, and speed up slightly.
        * NEWS: Document this fix.

head:
2006-10-31  Eric Blake  <address@hidden>

        * ltdl/m4/gnulib-cache.m4: Augment with 'gnulib-tool --import
        strstr'.
        * doc/m4.texinfo (Translit): Merge from branch.
        * modules/m4.c (divert, substr): Ignore excess arguments.
        (index, translit): Merge from branch.
        * tests/builtins.at (translit): Add a test.

- --
Life is short - so eat dessert first!

Eric Blake             address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFR1gf84KuGfSFAYARAiJRAJ92C6bnQh7QPmn2FodmrXtionErGQCg0mp6
+0OLVh7CZT5l+vIESpYfnAQ=
=W8sy
-----END PGP SIGNATURE-----
Index: NEWS
===================================================================
RCS file: /sources/m4/m4/NEWS,v
retrieving revision 1.1.1.1.2.77
diff -u -p -r1.1.1.1.2.77 NEWS
--- NEWS        29 Oct 2006 15:22:42 -0000      1.1.1.1.2.77
+++ NEWS        31 Oct 2006 13:18:16 -0000
@@ -43,6 +43,7 @@ Version 1.4.8 - ?? ??? 2006, by ??  (CVS
 * The `changecom' and `changequote' macros now treat an empty second
   argument the same as if it were missing, rather than using the empty
   string and making it impossible to end a comment or quote.
+* The `translit' macro now operates in linear instead of quadratic time.
 
 Version 1.4.7 - 25 September 2006, by Eric Blake  (CVS version 1.4.6a)
 
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.1.1.1.2.95
diff -u -p -r1.1.1.1.2.95 m4.texinfo
--- doc/m4.texinfo      29 Oct 2006 15:22:42 -0000      1.1.1.1.2.95
+++ doc/m4.texinfo      31 Oct 2006 13:18:19 -0000
@@ -4080,9 +4080,14 @@ Expands to @var{string}, with each chara
 the same index.
 
 If @var{replacement} is shorter than @var{chars}, the excess characters
-are deleted from the expansion.  If @var{replacement} is omitted, all
-characters in @var{string} that are present in @var{chars} are deleted
-from the expansion.
+of @var{chars} are deleted from the expansion; if @var{chars} is
+shorter, the excess characters in @var{replacement} are silently
+ignored.  If @var{replacement} is omitted, all characters in
address@hidden that are present in @var{chars} are deleted from the
+expansion.  If a character appears more than once in @var{chars}, only
+the first instance is used in making the translation.  Only a single
+translation pass is made, even if characters in @var{replacement} also
+appear in @var{chars}.
 
 As a @acronym{GNU} extension, both @var{chars} and @var{replacement} can
 contain character-ranges,
@@ -4104,12 +4109,17 @@ translit(`GNUs not Unix', `a-z', `A-Z')
 @result{}GNUS NOT UNIX
 translit(`GNUs not Unix', `A-Z', `z-a')
 @result{}tmfs not fnix
+translit(`abcdef', `aabdef', `bcged')
address@hidden
 @end example
 
 The first example deletes all uppercase letters, the second converts
 lowercase to uppercase, and the third `mirrors' all uppercase letters,
 while converting them to lowercase.  The two first cases are by far the
-most common.
+most common.  The final example shows that @samp{a} is mapped to
address@hidden, not @samp{c}; the resulting @samp{b} is not further remapped
+to @samp{g}; the @samp{d} and @samp{e} are swapped, and the @samp{f} is
+discarded.
 
 Omitting @var{chars} evokes a warning, but still produces output.
 
Index: m4/gnulib-cache.m4
===================================================================
RCS file: /sources/m4/m4/m4/Attic/gnulib-cache.m4,v
retrieving revision 1.1.2.17
diff -u -p -r1.1.2.17 gnulib-cache.m4
--- m4/gnulib-cache.m4  17 Oct 2006 16:13:03 -0000      1.1.2.17
+++ m4/gnulib-cache.m4  31 Oct 2006 13:18:19 -0000
@@ -15,11 +15,11 @@
 
 
 # Specification in the form of a command-line invocation:
-#   gnulib-tool --import --dir=. --lib=libm4 --source-base=lib --m4-base=m4 
--doc-base=doc --aux-dir=. --no-libtool --macro-prefix=M4 binary-io clean-temp 
cloexec close-stream closeout config-h error fdl fopen-safer free gendocs 
getopt gnupload mkstemp obstack regex stdlib-safer strtol unlocked-io verror 
xalloc xvasprintf
+#   gnulib-tool --import --dir=. --lib=libm4 --source-base=lib --m4-base=m4 
--doc-base=doc --aux-dir=. --no-libtool --macro-prefix=M4 binary-io clean-temp 
cloexec close-stream closeout config-h error fdl fopen-safer free gendocs 
getopt gnupload mkstemp obstack regex stdlib-safer strstr strtol unlocked-io 
verror xalloc xvasprintf
 
 # Specification in the form of a few gnulib-tool.m4 macro invocations:
 gl_LOCAL_DIR([])
-gl_MODULES([binary-io clean-temp cloexec close-stream closeout config-h error 
fdl fopen-safer free gendocs getopt gnupload mkstemp obstack regex stdlib-safer 
strtol unlocked-io verror xalloc xvasprintf])
+gl_MODULES([binary-io clean-temp cloexec close-stream closeout config-h error 
fdl fopen-safer free gendocs getopt gnupload mkstemp obstack regex stdlib-safer 
strstr strtol unlocked-io verror xalloc xvasprintf])
 gl_AVOID([])
 gl_SOURCE_BASE([lib])
 gl_M4_BASE([m4])
Index: src/builtin.c
===================================================================
RCS file: /sources/m4/m4/src/Attic/builtin.c,v
retrieving revision 1.1.1.1.2.47
diff -u -p -r1.1.1.1.2.47 builtin.c
--- src/builtin.c       29 Oct 2006 15:22:42 -0000      1.1.1.1.2.47
+++ src/builtin.c       31 Oct 2006 13:18:19 -0000
@@ -27,6 +27,7 @@
 extern FILE *popen ();
 
 #include "regex.h"
+#include "strstr.h"
 
 #if HAVE_SYS_WAIT_H
 # include <sys/wait.h>
@@ -1575,8 +1576,9 @@ m4_len (struct obstack *obs, int argc, t
 static void
 m4_index (struct obstack *obs, int argc, token_data **argv)
 {
-  const char *cp, *last;
-  int l1, l2, retval;
+  const char *haystack;
+  const char *result;
+  int retval;
 
   if (bad_argc (argv[0], argc, 3, 3))
     {
@@ -1586,17 +1588,9 @@ m4_index (struct obstack *obs, int argc,
       return;
     }
 
-  l1 = strlen (ARG (1));
-  l2 = strlen (ARG (2));
-
-  last = ARG (1) + l1 - l2;
-
-  for (cp = ARG (1); cp <= last; cp++)
-    {
-      if (strncmp (cp, ARG (2), l2) == 0)
-       break;
-    }
-  retval = (cp <= last) ? cp - ARG (1) : -1;
+  haystack = ARG (1);
+  result = strstr (haystack, ARG (2));
+  retval = result ? result - haystack : -1;
 
   shipout_int (obs, retval);
 }
@@ -1693,9 +1687,11 @@ expand_ranges (const char *s, struct obs
 static void
 m4_translit (struct obstack *obs, int argc, token_data **argv)
 {
-  register const char *data, *tmp;
-  const char *from, *to;
-  int tolen;
+  const unsigned char *data;
+  const unsigned char *from;
+  const unsigned char *to;
+  char map[256] = {0};
+  char found[256] = {0};
 
   if (bad_argc (argv[0], argc, 3, 4))
     {
@@ -1713,33 +1709,37 @@ m4_translit (struct obstack *obs, int ar
        return;
     }
 
-  if (argc >= 4)
+  to = ARG (3);
+  if (strchr (to, '-') != NULL)
+    {
+      to = expand_ranges (to, obs);
+      if (to == NULL)
+       return;
+    }
+
+  /* Calling strchr(from) for each character in data is quadratic,
+     since both strings can be arbitrarily long.  Instead, create a
+     from-to mapping in one pass of data, then use that map in one
+     pass of from, for linear behavior.  Traditional behavior is that
+     only the first instance of a character in from is consulted,
+     hence the found map.  */
+  for ( ; *from; from++)
     {
-      to = ARG (3);
-      if (strchr (to, '-') != NULL)
+      if (! found[*from])
        {
-         to = expand_ranges (to, obs);
-         if (to == NULL)
-           return;
+         found[*from] = 1;
+         map[*from] = *to;
        }
+      if (*to != '\0')
+       to++;
     }
-  else
-    to = "";
-
-  tolen = strlen (to);
 
   for (data = ARG (1); *data; data++)
     {
-      tmp = strchr (from, *data);
-      if (tmp == NULL)
-       {
-         obstack_1grow (obs, *data);
-       }
-      else
-       {
-         if (tmp - from < tolen)
-           obstack_1grow (obs, *(to + (tmp - from)));
-       }
+      if (! found[*data])
+       obstack_1grow (obs, *data);
+      else if (map[*data])
+       obstack_1grow (obs, map[*data]);
     }
 }
 
Index: ltdl/m4/gnulib-cache.m4
===================================================================
RCS file: /sources/m4/m4/ltdl/m4/gnulib-cache.m4,v
retrieving revision 1.18
diff -u -p -r1.18 gnulib-cache.m4
--- ltdl/m4/gnulib-cache.m4     28 Oct 2006 19:37:47 -0000      1.18
+++ ltdl/m4/gnulib-cache.m4     31 Oct 2006 14:04:45 -0000
@@ -15,11 +15,11 @@
 
 
 # Specification in the form of a command-line invocation:
-#   gnulib-tool --import --dir=. --lib=libgnu --source-base=gnu 
--m4-base=ltdl/m4 --doc-base=doc --aux-dir=ltdl/config --libtool 
--macro-prefix=M4 assert avltree-list binary-io clean-temp cloexec close-stream 
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer 
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic 
stdbool stdlib-safer strnlen strtol tempname unlocked-io verror xalloc 
xalloc-die xstrndup xstrtol xvasprintf
+#   gnulib-tool --import --dir=. --lib=libgnu --source-base=gnu 
--m4-base=ltdl/m4 --doc-base=doc --aux-dir=ltdl/config --libtool 
--macro-prefix=M4 assert avltree-list binary-io clean-temp cloexec close-stream 
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer 
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic 
stdbool stdlib-safer strnlen strstr strtol tempname unlocked-io verror xalloc 
xalloc-die xstrndup xstrtol xvasprintf
 
 # Specification in the form of a few gnulib-tool.m4 macro invocations:
 gl_LOCAL_DIR([])
-gl_MODULES([assert avltree-list binary-io clean-temp cloexec close-stream 
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer 
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic 
stdbool stdlib-safer strnlen strtol tempname unlocked-io verror xalloc 
xalloc-die xstrndup xstrtol xvasprintf])
+gl_MODULES([assert avltree-list binary-io clean-temp cloexec close-stream 
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer 
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic 
stdbool stdlib-safer strnlen strstr strtol tempname unlocked-io verror xalloc 
xalloc-die xstrndup xstrtol xvasprintf])
 gl_AVOID([])
 gl_SOURCE_BASE([gnu])
 gl_M4_BASE([ltdl/m4])
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.74
diff -u -p -r1.74 m4.texinfo
--- doc/m4.texinfo      27 Oct 2006 17:03:51 -0000      1.74
+++ doc/m4.texinfo      31 Oct 2006 14:04:48 -0000
@@ -4853,29 +4853,36 @@ The builtin macro @code{substr} is recog
 
 @cindex translating characters
 @cindex characters, translating
+Character translation is done with @code{translit}:
+
 @deffn {Builtin (m4)} translit (@var{string}, @var{chars}, @var{replacement})
-Character translation is done with @code{translit}, which expands to
address@hidden, with each character that occurs in @var{chars} translated
-into the character from @var{replacement} with the same index.
+Expands to @var{string}, with each character that occurs in
address@hidden translated into the character from @var{replacement} with
+the same index.
 
 If @var{replacement} is shorter than @var{chars}, the excess characters
-are deleted from the expansion.  If @var{replacement} is omitted, all
-characters in @var{string}, that are present in @var{chars} are deleted
-from the expansion.
+of @var{chars} are deleted from the expansion; if @var{chars} is
+shorter, the excess characters in @var{replacement} are silently
+ignored.  If @var{replacement} is omitted, all characters in
address@hidden that are present in @var{chars} are deleted from the
+expansion.  If a character appears more than once in @var{chars}, only
+the first instance is used in making the translation.  Only a single
+translation pass is made, even if characters in @var{replacement} also
+appear in @var{chars}.
 
-Both @var{chars} and @var{replacement} can contain character-ranges,
+As a @acronym{GNU} extension, both @var{chars} and @var{replacement} can
+contain character-ranges,
 e.g., @samp{a-z} (meaning all lowercase letters) or @samp{0-9} (meaning
 all digits).  To include a dash @samp{-} in @var{chars} or
 @var{replacement}, place it first or last.
 
-The builtin macro @code{translit} is recognized only when given
-arguments.
address@hidden deffn
-
-It is not an error for the last character in the range to be `smaller'
+It is not an error for the last character in the range to be `larger'
 than the first.  In that case, the range runs backwards, i.e.,
 @samp{9-0} means the string @samp{9876543210}.
 
+The macro @code{translit} is recognized only with parameters.
address@hidden deffn
+
 @example
 translit(`GNUs not Unix', `A-Z')
 @result{}s not nix
@@ -4883,12 +4890,25 @@ translit(`GNUs not Unix', `a-z', `A-Z')
 @result{}GNUS NOT UNIX
 translit(`GNUs not Unix', `A-Z', `z-a')
 @result{}tmfs not fnix
+translit(`abcdef', `aabdef', `bcged')
address@hidden
 @end example
 
 The first example deletes all uppercase letters, the second converts
 lowercase to uppercase, and the third `mirrors' all uppercase letters,
 while converting them to lowercase.  The two first cases are by far the
-most common.
+most common.  The final example shows that @samp{a} is mapped to
address@hidden, not @samp{c}; the resulting @samp{b} is not further remapped
+to @samp{g}; the @samp{d} and @samp{e} are swapped, and the @samp{f} is
+discarded.
+
+Omitting @var{chars} evokes a warning, but still produces output.
+
address@hidden
+translit(`abc')
address@hidden:stdin:1: Warning: translit: too few arguments: 1 < 2
address@hidden
address@hidden example
 
 @node Patsubst
 @section Substituting text by regular expression
Index: modules/m4.c
===================================================================
RCS file: /sources/m4/m4/modules/m4.c,v
retrieving revision 1.89
diff -u -p -r1.89 m4.c
--- modules/m4.c        31 Oct 2006 02:05:37 -0000      1.89
+++ modules/m4.c        31 Oct 2006 14:04:48 -0000
@@ -23,6 +23,7 @@
 #include <errno.h>
 
 #include "stdlib--.h"
+#include "strstr.h"
 #include "tempname.h"
 #include "unistd--.h"
 
@@ -550,7 +551,7 @@ M4BUILTIN_HANDLER (divert)
 {
   int i = 0;
 
-  if (argc == 2 && !m4_numeric_arg (context, argc, argv, 1, &i))
+  if (argc >= 2 && !m4_numeric_arg (context, argc, argv, 1, &i))
     return;
 
   m4_make_diversion (context, i);
@@ -878,20 +879,9 @@ M4BUILTIN_HANDLER (len)
    argument.  */
 M4BUILTIN_HANDLER (index)
 {
-  const char *cp, *last;
-  int l1, l2, retval;
-
-  l1 = strlen (M4ARG (1));
-  l2 = strlen (M4ARG (2));
-
-  last = M4ARG (1) + l1 - l2;
-
-  for (cp = M4ARG (1); cp <= last; cp++)
-    {
-      if (strncmp (cp, M4ARG (2), l2) == 0)
-       break;
-    }
-  retval = (cp <= last) ? cp - M4ARG (1) : -1;
+  const char *haystack = M4ARG (1);
+  const char *result = strstr (haystack, M4ARG (2));
+  int retval = result ? result - haystack : -1;
 
   m4_shipout_int (obs, retval);
 }
@@ -908,7 +898,7 @@ M4BUILTIN_HANDLER (substr)
   if (!m4_numeric_arg (context, argc, argv, 2, &start))
     return;
 
-  if (argc == 4 && !m4_numeric_arg (context, argc, argv, 3, &length))
+  if (argc >= 4 && !m4_numeric_arg (context, argc, argv, 3, &length))
     return;
 
   if (start < 0 || length <= 0 || start >= avail)
@@ -969,45 +959,49 @@ m4_expand_ranges (const char *s, m4_obst
    deleted from the first (pueh)  */
 M4BUILTIN_HANDLER (translit)
 {
-  register const char *data, *tmp;
-  const char *from, *to;
-  int tolen;
+  const unsigned char *data;
+  const unsigned char *from;
+  const unsigned char *to;
+  char map[256] = {0};
+  char found[256] = {0};
 
   from = M4ARG (2);
   if (strchr (from, '-') != NULL)
     {
       from = m4_expand_ranges (from, obs);
-      if (from == NULL)
-       return;
+      assert (from);
+    }
+
+  to = M4ARG (3);
+  if (strchr (to, '-') != NULL)
+    {
+      to = m4_expand_ranges (to, obs);
+      assert (to);
     }
 
-  if (argc == 4)
+  /* Calling strchr(from) for each character in data is quadratic,
+     since both strings can be arbitrarily long.  Instead, create a
+     from-to mapping in one pass of data, then use that map in one
+     pass of from, for linear behavior.  Traditional behavior is that
+     only the first instance of a character in from is consulted,
+     hence the found map.  */
+  for ( ; *from; from++)
     {
-      to = M4ARG (3);
-      if (strchr (to, '-') != NULL)
+      if (! found[*from])
        {
-         to = m4_expand_ranges (to, obs);
-         if (to == NULL)
-           return;
+         found[*from] = 1;
+         map[*from] = *to;
        }
+      if (*to != '\0')
+       to++;
     }
-  else
-    to = "";
-
-  tolen = strlen (to);
 
   for (data = M4ARG (1); *data; data++)
     {
-      tmp = strchr (from, *data);
-      if (tmp == NULL)
-       {
-         obstack_1grow (obs, *data);
-       }
-      else
-       {
-         if (tmp - from < tolen)
-           obstack_1grow (obs, *(to + (tmp - from)));
-       }
+      if (! found[*data])
+       obstack_1grow (obs, *data);
+      else if (map[*data])
+       obstack_1grow (obs, map[*data]);
     }
 }
 
Index: tests/builtins.at
===================================================================
RCS file: /sources/m4/m4/tests/builtins.at,v
retrieving revision 1.29
diff -u -p -r1.29 builtins.at
--- tests/builtins.at   27 Oct 2006 15:16:31 -0000      1.29
+++ tests/builtins.at   31 Oct 2006 14:04:48 -0000
@@ -862,6 +862,16 @@ z
 tmfs not fnix
 ]])
 
+dnl This used to be quadratic, taking more than 25 million comparisons,
+dnl but should now operate in linear time with only several thousand checks.
+AT_DATA([[in]],
+[[translit(`a]m4_for([i],[1],[5000],[],[[b]])[',
+ `]m4_for([i],[1],[5000],[],[[a]])[b',
+ `c]m4_for([i],[1],[5000],[],[[d]])[')
+]])
+AT_CHECK_M4([in], [0], [[c]m4_for([i],[1],[5000],[],[[d]])
+])
+
 AT_CLEANUP
 
 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]