bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

probable bug(s) in do_conversion() in iri.c


From: Derek Martin
Subject: probable bug(s) in do_conversion() in iri.c
Date: Tue, 20 Apr 2021 15:21:13 -0400
User-agent: Mutt/1.9.4 (2018-02-28)

Hi,

I have been trying to sort out the fix for CVE-2019-5953, which based
on a diff of wget-1.20 vs. 1.21.1, appears to be simply this, in
do_conversion, iri.c:

@@ -189,9 +190,10 @@
         {
           tooshort++;
           done = len;
-          len = outlen = done + inlen * 2;
-          s = xrealloc (s, outlen + 1);
-          *out = s + done;
+          len = done + inlen * 2;
+          s = xrealloc (s, len + 1);
+          *out = s + done - outlen;
+          outlen += inlen * 2;
         }
       else /* Weird, we got an unspecified error */
         {

However, regardless of what the fix actually was, it appears to me
this function in 1.21.1 produces incorrect results.  I have no valid
IRIs to test this against, so what I did to test was pull the function
out of wget, with very minor modifications to get it to compile, and
provide a tiny wrapper.

Even that proved challenging, because I could not find useful detailed
info about the bug, or the case that caused the buffer overflow, or
even how to get this code path to execute with natural valid input.
So what I did instead was to make one additional small change, to
force the initial allocation of the output buffer to be too small to
hold the converted output data, for the vast majority of cases.  

For input, I used the EUC-KR encoding of the Korean string,
"안녕하세요." (Annyeong haseyo):

$ od -xa test_input.txt
0000000    c8be    e7b3    cfc7    bcbc    e4bf    0a2e
          >   H   3   g   G   O   <   <   ?   d   .  nl
0000014

Then attempted to convert it from EUC-KR to UTF-8 using this function.
Doing so produced garbage, because after the (successful) conversion
from iconv, the code inserts a null into the string somewhere in the
middle, truncating the output (and in this case, breaking one of the 
characters:

$ gcc -g -o dct do_conversion_test.c
$ ./dct `cat test_input.txt` euc-kr utf-8
output buffer too small case hit
converted '�ȳ��ϼ���.' (euc-kr) -> '�' (utf-8)

I rewrote the function, and my updated version produces correct
results:

$ gcc -g -o dct_fix  do_conversion_fix.c
$ ./dct_fix `cat test_input.txt` euc-kr utf-8
output buffer too small case hit
converted ''
converted '�ȳ��ϼ���.' (euc-kr) -> '안녕하세요.' (utf-8)

Naturally the input data encoded in EUC-KR prints as gibberish since
it doesn't match my terminal settings, but importantly the output in
UTF-8 (which does match my settings) is correct.

I believe the 1.21.1 version of the function also has a case where:

+          *out = s + done - outlen;

can set *out to an address before the allocated buffer, because done
is zero and outlen is not.

Another *potential* issue with the wget-1.21.1 version is that it only
leaves 1 byte for the terminating null, however if the output format
is, say, UTF-32, the terminating null will be 4 bytes.  My version of
the function leaves 4 bytes for the null and fills them all in.  I did
not attempt to address the middle case (where errno == EINVAL or
EILSEQ) because I wasn't quite sure what the intended behavior there
is, so a future fix would need to address that as well, since I
renamed/changed some of the variables and how they're used to make the
logic a bit clearer (to me at least). [I just had my version return
false in this case...]

I've attempted to attach three files to this message:

do_conversion_test.c - The extracted function with minimal modifications
do_conversion_fix.c - My fix to this function 
test_input.txt - The input text I described above.

Thanks,
Derek Martin


Attachment: do_conversion_test.c
Description: Text Data

Attachment: do_conversion_fix.c
Description: Text Data

Attachment: test_input.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]