bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Match returns impossible character range


From: Aharon Robbins
Subject: Re: Match returns impossible character range
Date: Wed, 14 Jul 2010 22:36:22 +0300
User-agent: Heirloom mailx 12.4 7/29/08

Greetings. Re this:

> Date: Mon, 12 Jul 2010 12:49:15 +0100
> Subject: Match returns impossible character range
> From: Cefn Hoile <address@hidden>
> To: address@hidden
>
> Match seems to set RSTART and RLENGTH to impossible values, for
> example my error reporting code...
>
> match($0, nameregexp)
> if(RSTART){
>       print "Match found for: ", nameregexp, "  :at " RSTART, ",", RLENGTH
> }
>
> ...offers this result....
>
> Match found for:  <form[^>]*(name=[^[:space:]>]*)[^>]*>   :at 1 , 1
>
> I believe this is an impossibility, given that the fixed parts of this
> regular expression are more than 10 characters long!

Thanks for the report and the sample files (sent privately).  The
issue is related to locales - using LC_ALL=C gives the right results.

What's really going on is that you are in a UTF-8 locale but your data
isn't UTF-8; gawk should still give a more reasonable answer in such a
case. The following patch fixes the problem.

Thanks!

Arnold
--------------------------------------------------
Wed Jul 14 22:31:53 2010  Arnold D. Robbins  <address@hidden>

        * node.c (str2wstr): Keep going if get a bad multibyte sequence.
        Allows match to give correct answers for RSTART, RLENGTH.
        Add a lint warning.

Index: node.c
===================================================================
RCS file: /d/mongo/cvsrep/gawk-stable/node.c,v
retrieving revision 1.24
diff -u -r1.24 node.c
--- node.c      13 Apr 2010 19:39:23 -0000      1.24
+++ node.c      14 Jul 2010 19:30:33 -0000
@@ -755,6 +755,7 @@
        char *sp;
        mbstate_t mbs;
        wchar_t wc, *wsp;
+       static short warned = FALSE;
 
        assert((n->flags & (STRING|STRCUR)) != 0);
 
@@ -803,7 +804,24 @@
                switch (count) {
                case (size_t) -2:
                case (size_t) -1:
-                       goto done;
+                       /*
+                        * Just skip the bad byte and keep going, so that
+                        * we get a more-or-less full string, instead of
+                        * stopping early. This is particularly important
+                        * for match() where we need to build the indices.
+                        */
+                       sp++;
+                       /*
+                        * mbrtowc(3) says the state of mbs becomes undefined
+                        * after a bad character, so reset it.
+                        */
+                       memset(& mbs, 0, sizeof(mbs));
+                       /* And warn the user something's wrong */
+                       if (do_lint && ! warned) {
+                               warned = TRUE;
+                               lintwarn(_("Invalid multibyte data detected. 
There may be a mismatch between your data and your locale"));
+                       }
+                       break;
 
                case 0:
                        count = 1;
@@ -820,7 +838,6 @@
                }
        }
 
-done:
        *wsp = L'\0';
        n->wstlen = i;
        n->flags |= WSTRCUR;



reply via email to

[Prev in Thread] Current Thread [Next in Thread]