bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: probable bug(s) in do_conversion() in iri.c


From: Derek Martin
Subject: Re: Fwd: probable bug(s) in do_conversion() in iri.c
Date: Mon, 26 Apr 2021 18:12:11 -0400
User-agent: Mutt/1.9.4 (2018-02-28)

On Sun, Apr 25, 2021 at 04:37:45PM +0200, Tim Rühsen wrote:
> b) Did you take a look into Wget2 to compare the code ?
> There is a similar function to basically do the same that was written as
> library function and which is fuzzed continuously (code for the fuzzers is
> in /fuzz).
> https://gitlab.com/gnuwget/wget2/-/blob/master/libwget/encoding.c#L65
> 
> Do you think there is a similar issue, or can we possibly just take that
> code ? (Or maybe rework wget to use libwget in the long run).

Assuming I've not made an error reading this new function, that
function seems better in general, and as far as I can tell should work
fine for the typical case of converting US-ASCII or ISO-8859-1 or
any of the 16-bit character sets to UTF-8.  It has better protection
against buffer overflows or trying to convert into a buffer that's too
small, because it allocates 6x the amount of space used by the
original string, which AFAIK is always (more than) enough to do the
conversion successfully.  [NOTE: I'm not aware of any character sets
that use more than 4 bytes/character, although that doesn't mean one
doesn't exist.  Hopefully you have access to an expert who knows the
answer.]

However as far as I can tell, it still has the problem that if the
character set being converted to is one whose null has a size greater
than 8-bits, then the termination is wrong (not enough null bytes).
This would include at least UTF-16 and UTF-32.  I believe it also
includes EUC-KR and EUC-CN which I *think* are always 2-byte
characters, and probably includes some eastern European character
sets.  But I am no expert in character sets, so that may or may not be
true, and there may or may not be more.

FWIW, My version of this function assumes the maximum character width
is 4 bytes e.g. for UTF-32.  AFAICT The extra space allocated for the
null terminator needs to be at least as wide as the maximum possible
null character width, and should always be filled in completely,
unless the code wants to calculate the correct size based on the "to"
encoding.  So if 4 bytes is not the maximum width of a null, then that
would need to be adjusted.


-- 
Derek Martin
Senior System Software Engineer II Lead
Akamai Technologies
demartin@akamai.com



reply via email to

[Prev in Thread] Current Thread [Next in Thread]