bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches


From: Aharon Robbins
Subject: Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches
Date: Sun, 12 Feb 2012 21:30:20 +0200
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

I have been looking at this and trying to see if I can reproduce
it in gawk. I can't seem too. Would someone who understands the
issue supply me with a test awk program that either shows that
gawk has this bug, or doesn't?

Thanks,

Arnold

> Going deeper, re_search_internal() calls re_string_reconstruct() and
> that calls re_string_skip_chars().
>
> re_string_skip_chars() is a I18N specific function that jumps by
> characters up to the indexed character. It is a multi-byte character
> wise function.
>
> In case of correct run, it returns correct index to the next character
> to inspect. In case of bug occurrence, __mbrtowc called from there
> returns -2 (incomplete multi-byte character). Why? It seems to be caused
> by remain_len being equal 1, even if there is still 6 bytes to inspect
> ("\267\357a\277\267\275").
>
> I believe, that remain_len is computed incorrectly:
>
> sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()
>
>       remain_len = pstr->len - rawbuf_idx;
>
> pstr->len seems to be length of the remaining part of the string,
> rawbuf_idx is the index of the remaining part of the string in the
> original (raw) string.
>
> I am not quite familiar with the code, but I believe that the expression
> should be:
> remain_len = pstr->raw_len - rawbuf_idx;
>
>
> Example:
>
> stop in the first iteration of the re_string_skip_chars()
>
> Correct case (two leading "a" characters):
> rawbuf_idx =3D 5
> *pstr =3D {
>   raw_mbs =3D 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \3=
> 50>, mbs =3D 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>=
> ,=20
>   wcs =3D 0x648190, offsets =3D 0x0, cur_state =3D {__count =3D 0, __valu=
> e =3D {
>       __wch =3D 0, __wchb =3D "\000\000\000"}}, raw_mbs_idx =3D 2,=20
>   valid_len =3D 0, valid_raw_len =3D 3, bufs_len =3D 4, cur_idx =3D 2,=20
>   raw_len =3D 9, len =3D 7, raw_stop =3D 9, stop =3D 7, tip_context =3D 0=
> ,=20
>   trans =3D 0x0, word_char =3D 0x647d88, icase =3D 0 '\000',=20
>   is_utf8 =3D 0 '\000', map_notascii =3D 0 '\000', mbs_allocated =3D 0 '\=
> 000',=20
>   offsets_needed =3D 0 '\000', newline_anchor =3D 0 '\000',=20
>   word_ops_used =3D 0 '\000', mb_cur_max =3D 3}
>
> Buggy case (three leading "a" characters):
> rawbuf_idx =3D 6
> *pstr =3D {
>   raw_mbs =3D 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \=
> 350>, mbs =3D 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350=
> >,=20
>   wcs =3D 0x648190, offsets =3D 0x0, cur_state =3D {__count =3D 0, __valu=
> e =3D {
>       __wch =3D 0, __wchb =3D "\000\000\000"}}, raw_mbs_idx =3D 3,=20
>   valid_len =3D 0, valid_raw_len =3D 3, bufs_len =3D 4, cur_idx =3D 2,=20
>   raw_len =3D 10, len =3D 7, raw_stop =3D 10, stop =3D 7, tip_context =3D=
>  0,=20
>   trans =3D 0x0, word_char =3D 0x647d88, icase =3D 0 '\000',=20
>   is_utf8 =3D 0 '\000', map_notascii =3D 0 '\000', mbs_allocated =3D 0 '\=
> 000',=20
>   offsets_needed =3D 0 '\000', newline_anchor =3D 0 '\000',=20
>   word_ops_used =3D 0 '\000', mb_cur_max =3D 3}
>
>
> If my observation is correct, the bug is not EUC-JP specific.
>
> Bug triggers:
> - Charset must be capable to constitute false match on the boundary of
>   two characters. EUC-JP fits this requirement, UTF-8 probably does not.
> - There is a true ASCII match that is false match in locale specific
>   charset.
> - This false match must appear in an exact place near two thirds of the
>   string.
>
> Index: sed-4.2.1/sed/execute.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- sed-4.2.1.orig/sed/execute.c
> +++ sed-4.2.1/sed/execute.c
> @@ -261,7 +261,7 @@ str_append(to, string, length)
>           n =3D 1;
>         }
> =20
> -        if (n > 0)
> +        if ((n !=3D (size_t) -2) && (n > 0))
>         {
>           string +=3D n;
>           length -=3D n;
> Index: sed-4.2.1/lib/regex_internal.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- sed-4.2.1.orig/lib/regex_internal.c
> +++ sed-4.2.1/lib/regex_internal.c
> @@ -499,7 +499,7 @@ re_string_skip_chars (re_string_t *pstr,
>      {
>        wchar_t wc2;
>        Idx remain_len;
> -      remain_len =3D pstr->len - rawbuf_idx;
> +      remain_len =3D pstr->raw_len - rawbuf_idx;
>        prev_st =3D pstr->cur_state;
>        mbclen =3D __mbrtowc (&wc2, (const char *) pstr->raw_mbs + rawbuf_=
> idx,
>                         remain_len, &pstr->cur_state);
>
>
> --=20
> Best Regards / S pozdravem,
>
> Stanislav Brabec
> software developer
> ---------------------------------------------------------------------
> SUSE LINUX, s. r. o.                          e-mail: address@hidden
> Lihovarsk=C3=A1 1060/12                            tel: +49 911 740538454=
> 7
> 190 00 Praha 9                                  fax: +420 284 028 951
> Czech Republic                                    http://www.suse.cz/
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]