[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] preconv
From: |
Bruno Haible |
Subject: |
[Groff] preconv |
Date: |
Sat, 31 Dec 2005 16:00:01 +0100 |
User-agent: |
KMail/1.5 |
Hello Werner,
This is a great joy, to see groff accept input in most encodings!
Please consider the pieces of the appended patch,
- In the comments: libiconv is not just my libiconv, it is GNU libiconv.
- The current versions of libiconv and libc are 1.10 and 2.3.6, respectively,
and they support the same encodings as 1.9.1 and 2.3.3, respectively, plus
EUC-JISX0213 and Shift_JISX0123 (see below).
- In the emacs_to_mime conversion table:
- It is generally a bad idea to map ASCII to ISO-8859-1. Why?
1. Because when someday UTF-8 has replaced all 8-bit encodings, you
will not be able to map ASCII to UTF-8 without causing backward
compatibility problems.
2. Because many developers and users work in the US or in Western Europe.
If there are problems with non-ASCII characters, that pose fatal
internationalization problems to CJK and other users, the majority of
users will not notice the problems, and the problem will take longer
until it is fixed.
- "chinese-euc" is an alias for GB2312. I looked it up in the XEmacs sources.
- "cp437" and others: You might generally think that "CP437" and "IBM437"
are the same. When you look at the character set comparisons in
http://www.haible.de/bruno/charsets/conversion-tables/CP437.html
(and similar, see http://www.haible.de/bruno/charsets/conversion-tables/
for the complete list), you will see that on some systems, IBM437 is
not implemented in an ASCII-compatible way: It permutes the control
characters. This problem doesn't exist with the name "CP437". Therefore
I recommend to use the "CPxxx" names rather than the "IBMxxx" names.
- You mapped
cp932 to SHIFT_JIS
cp936 to GB2312
cp949 to EUC-KR
cp950 to Big5
But these encodings are not the same! cp932 is a superset of SHIFT_JIS,
cp936 is a superset of GB2312, etc. Here are the character counts:
cp932 9795 SHIFT_JIS 8950
cp936 23334 GB2312 7573
cp949 17364 EUC-KR 8354
cp950 19440 Big5 13831
If you try to convert a piece of text which is in CP932 through a
SHIFT_JIS to Unicode converter, you will encounter many conversion errors,
and similarly for the other three pairs.
Therefore it is better to leave "cp932", "cp936", "cp949", "cp950" alone.
Even although three of them are not MIME registered.
- EUC-JISX0213 and Shift_JISX0213 are supported by glibc and libiconv
nowadays. You can add them to the table.
- Likewise for ISO-2022-JP-3. Do you know the distinction Emacs makes
between "iso-2022-jp-3", "iso-2022-jp-3-compatible", and
"iso-2022-jp-3-strict"?
- "japanese-euc" is the same as EUC-JP. I looked in the XEmacs sources.
- It's worth a comment to indicate that Emacs's "koi8" encoding is not
the same as glibc's "KOI8" encoding.
- "korean-euc" is the same as EUC-KR. I looked in the XEmacs sources.
- In emacs2mime, one needs to test emacs_enc_len in order to avoid an invalid
memory access when emacs_enc_len<4.
- conversion_iconv: The handle allocated with iconv_open() needs to be closed
with iconv_close(). It contains memory and/or file descriptor resources.
- In BOM_table, I would not comment out the little-endian UTF-32 BOM. It is
the only way to prevent misinterpreting a file in little-endian UTF-32 as
little-endian UTF-16. You have to trust that the input file will not have
NUL characters.
- In do_file, I would provide an error message for the case that
emacs2mime failed. Otherwise you end up passing the empty string as an
encoding name to iconv(), which will lead to different results in glibc and
libiconv, and certainly not to a useful error message.
Bruno
--- src/preproc/preconv/preconv.cpp.bak 2005-12-30 10:31:50.000000000 +0100
+++ src/preproc/preconv/preconv.cpp 2005-12-31 00:58:11.000000000 +0100
@@ -61,7 +61,7 @@
// http://www.iana.org/assignments/character-sets
//
// For encodings which don't have a MIME tag we use GNU iconv's encoding
-// names (which also work with Bruno Haible's libinconv package). They
+// names (which also work with the portable GNU libiconv package). They
// are marked with `*'.
//
// Encodings marked with `--' are special to Emacs or other applications and
@@ -71,7 +71,7 @@
// nor by libiconv, or just one of them has support for it.
//
// A special case is VIQR encoding: Despite of having a MIME tag it is
-// missing in both libiconv 1.9.1 and iconv (coming with GNU libc 2.3.3).
+// missing in both libiconv 1.10 and iconv (coming with GNU libc 2.3.6).
//
// Finally, we add all aliases of GNU iconv for `ascii' (handled as
// latin-1), `latin1', and `utf8' to catch those encoding names before iconv
@@ -81,11 +81,11 @@
emacs_to_mime[] = {
{"alternativnyj", ""}, // ?
{"arabic-iso-8bit", "ISO-8859-6"},
- {"ascii", "ISO-8859-1"},
+ {"ascii", "US-ASCII"},
{"big5", "Big5"},
{"binary", ""}, // --
{"chinese-big5", "Big5"},
- {"chinese-euc", ""}, // XEmacs?
+ {"chinese-euc", "GB2312"},
{"chinese-hz", "HZ-GB-2312"},
{"chinese-iso-7bit", "ISO-2022-CN"},
{"chinese-iso-8bit", "GB2312"},
@@ -105,31 +105,31 @@
{"cp1256", "windows-1256"},
{"cp1257", "windows-1257"},
{"cp1258", "windows-1258"},
- {"cp437", "IBM437"},
+ {"cp437", "cp437"},
{"cp720", ""}, // not covered
{"cp737", "cp737"}, // *
- {"cp775", "IBM775"},
- {"cp850", "IBM850"},
- {"cp851", "IBM851"},
- {"cp852", "IBM852"},
- {"cp855", "IBM855"},
- {"cp857", "IBM857"},
- {"cp860", "IBM860"},
- {"cp861", "IBM861"},
- {"cp862", "IBM862"},
- {"cp863", "IBM863"},
- {"cp864", "IBM864"},
- {"cp865", "IBM865"},
- {"cp866", "IBM866"},
+ {"cp775", "cp775"},
+ {"cp850", "cp850"},
+ {"cp851", "cp851"},
+ {"cp852", "cp852"},
+ {"cp855", "cp855"},
+ {"cp857", "cp857"},
+ {"cp860", "cp860"},
+ {"cp861", "cp861"},
+ {"cp862", "cp862"},
+ {"cp863", "cp863"},
+ {"cp864", "cp864"},
+ {"cp865", "cp865"},
+ {"cp866", "cp866"},
{"cp866u", "cp1125"}, // *
- {"cp869", "IBM869"},
+ {"cp869", "cp869"},
{"cp874", "cp874"}, // *
{"cp878", "KOI8-R"},
- {"cp932", "SHIFT_JIS"},
- {"cp936", "GB2312"},
- {"cp949", "EUC-KR"},
- {"cp950", "Big5"},
- {"csascii", "ISO-8859-1"}, // alias
+ {"cp932", "cp932"}, // *
+ {"cp936", "cp936"},
+ {"cp949", "cp949"}, // *
+ {"cp950", "cp950"}, // *
+ {"csascii", "US-ASCII"}, // alias
{"csisolatin1", "ISO-8859-1"}, // alias
{"ctext", ""}, // --
{"ctext-no-compositions", ""}, // --
@@ -146,7 +146,7 @@
{"euc-cn", "GB2312"},
{"euc-japan", "EUC-JP"},
{"euc-japan-1990", "EUC-JP"},
- {"euc-jisx0213", ""}, // XEmacs?
+ {"euc-jisx0213", "EUC-JISX0213"}, // *
{"euc-jisx0213-with-esc", ""}, // XEmacs?
{"euc-jp", "EUC-JP"},
{"euc-korea", "EUC-KR"},
@@ -182,9 +182,9 @@
{"iso-2022-jp", "ISO-2022-JP"},
{"iso-2022-jp-1978-irv", "ISO-2022-JP"},
{"iso-2022-jp-2", "ISO-2022-JP-2"},
- {"iso-2022-jp-3", ""}, // XEmacs?
+ {"iso-2022-jp-3", "ISO-2022-JP-3"}, // *
{"iso-2022-jp-3-compatible", ""}, // XEmacs?
- {"iso-2022-jp-3-strict", ""}, // XEmacs?
+ {"iso-2022-jp-3-strict", "ISO-2022-JP-3"}, // *
{"iso-2022-kr", "ISO-2022-KR"},
{"iso-2022-lock", ""}, // XEmacs?
{"iso-8859-1", "ISO-8859-1"},
@@ -223,15 +223,15 @@
{"japanese-iso-7bit-1978-irv", "ISO-2022-JP"},
{"japanese-iso-8bit", "EUC-JP"},
{"japanese-iso-8bit-with-esc", ""}, // --
- {"japanese-euc", ""}, // XEmacs?
+ {"japanese-euc", "EUC-JP"}, // *
{"japanese-shift-jis", "Shift_JIS"},
{"japanese-shift-jisx0213", ""}, // XEmacs?
{"junet", "ISO-2022-JP"},
- {"koi8", "KOI8-R"},
+ {"koi8", "KOI8-R"}, // not KOI8 !
{"koi8-r", "KOI8-R"},
{"koi8-t", "KOI8-T"}, // *
{"koi8-u", "KOI8-U"},
- {"korean-euc", ""}, // XEmacs?
+ {"korean-euc", "EUC-KR"},
{"korean-iso-7bit-lock", "ISO-2022-KR"},
{"korean-iso-8bit", "EUC-KR"},
{"korean-iso-8bit-with-esc", ""}, // --
@@ -267,7 +267,7 @@
{"raw-text", ""}, // --
{"ruscii", "cp1125"}, // *
{"shift_jis", "Shift_JIS"},
- {"shift_jisx0213", ""}, // XEmacs?
+ {"shift_jisx0213", "Shift_JISX0213"}, // *
{"sjis", "Shift_JIS"},
{"tcvn", "TCVN"}, // *
{"tcvn-5712", "TCVN"}, // *
@@ -320,11 +320,11 @@
emacs2mime(char *emacs_enc)
{
int emacs_enc_len = strlen(emacs_enc);
- if (!strcasecmp(emacs_enc + emacs_enc_len - 4, "-dos"))
+ if (emacs_enc_len > 4 && !strcasecmp(emacs_enc + emacs_enc_len - 4, "-dos"))
emacs_enc[emacs_enc_len - 4] = 0;
- if (!strcasecmp(emacs_enc + emacs_enc_len - 4, "-mac"))
+ if (emacs_enc_len > 4 && !strcasecmp(emacs_enc + emacs_enc_len - 4, "-mac"))
emacs_enc[emacs_enc_len - 4] = 0;
- if (!strcasecmp(emacs_enc + emacs_enc_len - 5, "-unix"))
+ if (emacs_enc_len > 5 && !strcasecmp(emacs_enc + emacs_enc_len - 5, "-unix"))
emacs_enc[emacs_enc_len - 5] = 0;
for (const conversion *table = emacs_to_mime; table->from; table++)
if (!strcasecmp(emacs_enc, table->from))
@@ -662,6 +662,7 @@
}
read_start = inbuf + inbytes_left;
}
+ iconv_close(handle);
// XXX use ferror?
limit = (char *)outbuf + BUFSIZ * sizeof (int) - outbytes_left;
for (int *ptr = outbuf; (char *)ptr < limit; ptr++)
@@ -696,7 +697,7 @@
const char *str;
} BOM_table[] = {
{4, "\x00\x00\xFE\xFF"},
-// {4, "\xFF\xFE\x00\x00"},
+ {4, "\xFF\xFE\x00\x00"},
{3, "\xEF\xBB\xBF"},
{2, "\xFE\xFF"},
{2, "\xFF\xFE"},
@@ -961,7 +962,12 @@
encoding_string[MAX_VAR_LEN - 1] = 0;
encoding = encoding_string;
// Translate from MIME & Emacs encoding names to locale encoding names.
- encoding = emacs2mime(encoding);
+ encoding = emacs2mime(encoding_string);
+ if (encoding[0] == '\0') {
+ error("encoding `%1' not supported, not a portable encoding",
+ encoding_string);
+ return 0;
+ }
if (debug)
fprintf(stderr, " encoding used: `%s'\n", encoding);
data = BOM + data;