nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] bug in decode_rfc2047()


From: Ken Hornstein
Subject: Re: [Nmh-workers] bug in decode_rfc2047()
Date: Thu, 03 Jan 2013 13:31:45 -0500

>The root of all this is iconv's behavior that requires us to
>skip past the invalid character.  Looking at it now, I wonder if
>we can do better than the current special handling for UTF-8?
>It's the "fromutf8" block below:
>[...]

Hm.  I played around with this a bit, and I'm not sure what to do.

iconv() doesn't distinguish between "We can't convert this character to
the target character set" and "This multibyte sequence is invalid"; they
both get EILSEQ.  Even worse, we can't (portably) tell where the end of
a multibyte sequence is.

So, I see a couple of options.  We could go completely portable and put
in a "?" (or whatever) for every byte that's invalid.  That would have
us generate multiple "?" for multibyte character sets like UTF8.  We could
suppress multiple invalid bytes in a row so there's just one "?", but
that seems kinda lousy to me.

GNU libiconv (which is seems like a fair number of people use) has
an iconvctl() function and it has an undocumented function that lets
you create your own substitution function for invalid bytes/codepoints.
That function isn't part of POSIX.  The fact that it's undocumented and
nonstandard makes me think we shouldn't use it.

Unless we have a LOT of multibyte character sets to deal with, perhaps
the special-case here for UTF8 is the best alternative?  Any other thoughts
on this matter?

--Ken



reply via email to

[Prev in Thread] Current Thread [Next in Thread]