|
From: | Patrick Georgi |
Subject: | Re: [Monotone-devel] iconv diffs [Was: Why is utf8...] |
Date: | Sat, 17 Feb 2007 08:45:46 +0100 |
User-agent: | Thunderbird 1.5.0.8 (X11/20061204) |
Nathaniel Smith schrieb:
For solaris: it will fail as it can't find that table you refer ("ASCII//whatever") as it's non-standard. The same for BSD, unless they rebuilt the GNU extension (in which case you'd better look out for implementation differences)have no idea what's going to happen on, say, OSX or *BSD or Solaris.
"skip ahead a byte" is troublesome - if your illegal sequence is a multibyte character (or even some state machine changing sequence in some of the obscure encodings), your next character will be wrong or illegal, too.One option is just to write our own "//IGNORE"-style iconv wrapper. iconv's normal API is that it does as much work as it can, then it tells you where it bombed out. It's perfectly possible at that point to skip ahead a byte or more on the input, stick a question mark in the output string, and then try again from there. Not the most efficient thing in the world, but probably a lot easier than trying to ship iconv conversion tables.
but skipping a character should be possible:- build another iconv state that translates input encoding into input encoding (unless that enables a fast-path, which I'm not sure of - alternative might be some encoding that is the ultimate superset, if such an encoding exists) - push first unknown byte into it. if that creates a response already, discard (as it might be some header sequence) and restart with the same byte in the next step, otherwise start at the next byte
- until iconv emits a response, push byte after byte into it - skip that many bytes in the input, replace with one "?" not so simple anymore, but imho still easier than integrating gnu iconv. patrick georgi
[Prev in Thread] | Current Thread | [Next in Thread] |