[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nmh-workers] bug in decode_rfc2047()
From: |
Ken Hornstein |
Subject: |
Re: [Nmh-workers] bug in decode_rfc2047() |
Date: |
Thu, 03 Jan 2013 13:31:45 -0500 |
>The root of all this is iconv's behavior that requires us to
>skip past the invalid character. Looking at it now, I wonder if
>we can do better than the current special handling for UTF-8?
>It's the "fromutf8" block below:
>[...]
Hm. I played around with this a bit, and I'm not sure what to do.
iconv() doesn't distinguish between "We can't convert this character to
the target character set" and "This multibyte sequence is invalid"; they
both get EILSEQ. Even worse, we can't (portably) tell where the end of
a multibyte sequence is.
So, I see a couple of options. We could go completely portable and put
in a "?" (or whatever) for every byte that's invalid. That would have
us generate multiple "?" for multibyte character sets like UTF8. We could
suppress multiple invalid bytes in a row so there's just one "?", but
that seems kinda lousy to me.
GNU libiconv (which is seems like a fair number of people use) has
an iconvctl() function and it has an undocumented function that lets
you create your own substitution function for invalid bytes/codepoints.
That function isn't part of POSIX. The fact that it's undocumented and
nonstandard makes me think we shouldn't use it.
Unless we have a LOT of multibyte character sets to deal with, perhaps
the special-case here for UTF8 is the best alternative? Any other thoughts
on this matter?
--Ken