[Groff] preconv

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] preconv

From:	Bruno Haible
Subject:	[Groff] preconv
Date:	Sat, 31 Dec 2005 16:00:01 +0100
User-agent:	KMail/1.5
Hello Werner,

This is a great joy, to see groff accept input in most encodings!

Please consider the pieces of the appended patch,

- In the comments: libiconv is not just my libiconv, it is GNU libiconv.

- The current versions of libiconv and libc are 1.10 and 2.3.6, respectively,
  and they support the same encodings as 1.9.1 and 2.3.3, respectively, plus
  EUC-JISX0213 and Shift_JISX0123 (see below).

- In the emacs_to_mime conversion table:
  - It is generally a bad idea to map ASCII to ISO-8859-1. Why?
    1. Because when someday UTF-8 has replaced all 8-bit encodings, you
       will not be able to map ASCII to UTF-8 without causing backward
       compatibility problems.
    2. Because many developers and users work in the US or in Western Europe.
       If there are problems with non-ASCII characters, that pose fatal
       internationalization problems to CJK and other users, the majority of
       users will not notice the problems, and the problem will take longer
       until it is fixed.
  - "chinese-euc" is an alias for GB2312. I looked it up in the XEmacs sources.
  - "cp437" and others: You might generally think that "CP437" and "IBM437"
    are the same. When you look at the character set comparisons in
      http://www.haible.de/bruno/charsets/conversion-tables/CP437.html
    (and similar, see http://www.haible.de/bruno/charsets/conversion-tables/
    for the complete list), you will see that on some systems, IBM437 is
    not implemented in an ASCII-compatible way: It permutes the control
    characters. This problem doesn't exist with the name "CP437". Therefore
    I recommend to use the "CPxxx" names rather than the "IBMxxx" names.
  - You mapped
      cp932 to SHIFT_JIS
      cp936 to GB2312
      cp949 to EUC-KR
      cp950 to Big5
    But these encodings are not the same! cp932 is a superset of SHIFT_JIS,
    cp936 is a superset of GB2312, etc. Here are the character counts:
       cp932  9795     SHIFT_JIS 8950
       cp936 23334     GB2312    7573
       cp949 17364     EUC-KR    8354
       cp950 19440     Big5     13831
    If you try to convert a piece of text which is in CP932 through a
    SHIFT_JIS to Unicode converter, you will encounter many conversion errors,
    and similarly for the other three pairs.
    Therefore it is better to leave "cp932", "cp936", "cp949", "cp950" alone.
    Even although three of them are not MIME registered.
  - EUC-JISX0213 and Shift_JISX0213 are supported by glibc and libiconv
    nowadays. You can add them to the table.
  - Likewise for ISO-2022-JP-3. Do you know the distinction Emacs makes
    between "iso-2022-jp-3", "iso-2022-jp-3-compatible", and
    "iso-2022-jp-3-strict"?
  - "japanese-euc" is the same as EUC-JP. I looked in the XEmacs sources.
  - It's worth a comment to indicate that Emacs's "koi8" encoding is not
    the same as glibc's "KOI8" encoding.
  - "korean-euc" is the same as EUC-KR. I looked in the XEmacs sources.

- In emacs2mime, one needs to test emacs_enc_len in order to avoid an invalid
  memory access when emacs_enc_len<4.

- conversion_iconv: The handle allocated with iconv_open() needs to be closed
  with iconv_close(). It contains memory and/or file descriptor resources.

- In BOM_table, I would not comment out the little-endian UTF-32 BOM. It is
  the only way to prevent misinterpreting a file in little-endian UTF-32 as
  little-endian UTF-16. You have to trust that the input file will not have
  NUL characters.

- In do_file, I would provide an error message for the case that
  emacs2mime failed. Otherwise you end up passing the empty string as an
  encoding name to iconv(), which will lead to different results in glibc and
  libiconv, and certainly not to a useful error message.

Bruno


--- src/preproc/preconv/preconv.cpp.bak 2005-12-30 10:31:50.000000000 +0100
+++ src/preproc/preconv/preconv.cpp     2005-12-31 00:58:11.000000000 +0100
@@ -61,7 +61,7 @@
 //   http://www.iana.org/assignments/character-sets
 //
 // For encodings which don't have a MIME tag we use GNU iconv's encoding
-// names (which also work with Bruno Haible's libinconv package).  They
+// names (which also work with the portable GNU libiconv package).  They
 // are marked with `*'.
 //
 // Encodings marked with `--' are special to Emacs or other applications and
@@ -71,7 +71,7 @@
 // nor by libiconv, or just one of them has support for it.
 //
 // A special case is VIQR encoding: Despite of having a MIME tag it is
-// missing in both libiconv 1.9.1 and iconv (coming with GNU libc 2.3.3).
+// missing in both libiconv 1.10 and iconv (coming with GNU libc 2.3.6).
 //
 // Finally, we add all aliases of GNU iconv for `ascii' (handled as
 // latin-1), `latin1', and `utf8' to catch those encoding names before iconv
@@ -81,11 +81,11 @@
 emacs_to_mime[] = {
   {"alternativnyj",                    ""},            // ?
   {"arabic-iso-8bit",                  "ISO-8859-6"},
-  {"ascii",                            "ISO-8859-1"},
+  {"ascii",                            "US-ASCII"},
   {"big5",                             "Big5"},
   {"binary",                           ""},            // --
   {"chinese-big5",                     "Big5"},
-  {"chinese-euc",                      ""},            // XEmacs?
+  {"chinese-euc",                      "GB2312"},
   {"chinese-hz",                       "HZ-GB-2312"},
   {"chinese-iso-7bit",                 "ISO-2022-CN"},
   {"chinese-iso-8bit",                 "GB2312"},
@@ -105,31 +105,31 @@
   {"cp1256",                           "windows-1256"},
   {"cp1257",                           "windows-1257"},
   {"cp1258",                           "windows-1258"},
-  {"cp437",                            "IBM437"},
+  {"cp437",                            "cp437"},
   {"cp720",                            ""},            // not covered
   {"cp737",                            "cp737"},       // *
-  {"cp775",                            "IBM775"},
-  {"cp850",                            "IBM850"},
-  {"cp851",                            "IBM851"},
-  {"cp852",                            "IBM852"},
-  {"cp855",                            "IBM855"},
-  {"cp857",                            "IBM857"},
-  {"cp860",                            "IBM860"},
-  {"cp861",                            "IBM861"},
-  {"cp862",                            "IBM862"},
-  {"cp863",                            "IBM863"},
-  {"cp864",                            "IBM864"},
-  {"cp865",                            "IBM865"},
-  {"cp866",                            "IBM866"},
+  {"cp775",                            "cp775"},
+  {"cp850",                            "cp850"},
+  {"cp851",                            "cp851"},
+  {"cp852",                            "cp852"},
+  {"cp855",                            "cp855"},
+  {"cp857",                            "cp857"},
+  {"cp860",                            "cp860"},
+  {"cp861",                            "cp861"},
+  {"cp862",                            "cp862"},
+  {"cp863",                            "cp863"},
+  {"cp864",                            "cp864"},
+  {"cp865",                            "cp865"},
+  {"cp866",                            "cp866"},
   {"cp866u",                           "cp1125"},      // *
-  {"cp869",                            "IBM869"},
+  {"cp869",                            "cp869"},
   {"cp874",                            "cp874"},       // *
   {"cp878",                            "KOI8-R"},
-  {"cp932",                            "SHIFT_JIS"},
-  {"cp936",                            "GB2312"},
-  {"cp949",                            "EUC-KR"},
-  {"cp950",                            "Big5"},
-  {"csascii",                          "ISO-8859-1"},  // alias
+  {"cp932",                            "cp932"},       // *
+  {"cp936",                            "cp936"},
+  {"cp949",                            "cp949"},       // *
+  {"cp950",                            "cp950"},       // *
+  {"csascii",                          "US-ASCII"},    // alias
   {"csisolatin1",                      "ISO-8859-1"},  // alias
   {"ctext",                            ""},            // --
   {"ctext-no-compositions",            ""},            // --
@@ -146,7 +146,7 @@
   {"euc-cn",                           "GB2312"},
   {"euc-japan",                                "EUC-JP"},
   {"euc-japan-1990",                   "EUC-JP"},
-  {"euc-jisx0213",                     ""},            // XEmacs?
+  {"euc-jisx0213",                     "EUC-JISX0213"}, // *
   {"euc-jisx0213-with-esc",            ""},            // XEmacs?
   {"euc-jp",                           "EUC-JP"},
   {"euc-korea",                                "EUC-KR"},
@@ -182,9 +182,9 @@
   {"iso-2022-jp",                      "ISO-2022-JP"},
   {"iso-2022-jp-1978-irv",             "ISO-2022-JP"},
   {"iso-2022-jp-2",                    "ISO-2022-JP-2"},
-  {"iso-2022-jp-3",                    ""},            // XEmacs?
+  {"iso-2022-jp-3",                    "ISO-2022-JP-3"}, // *
   {"iso-2022-jp-3-compatible",         ""},            // XEmacs?
-  {"iso-2022-jp-3-strict",             ""},            // XEmacs?
+  {"iso-2022-jp-3-strict",             "ISO-2022-JP-3"}, // *
   {"iso-2022-kr",                      "ISO-2022-KR"},
   {"iso-2022-lock",                    ""},            // XEmacs?
   {"iso-8859-1",                       "ISO-8859-1"},
@@ -223,15 +223,15 @@
   {"japanese-iso-7bit-1978-irv",       "ISO-2022-JP"},
   {"japanese-iso-8bit",                        "EUC-JP"},
   {"japanese-iso-8bit-with-esc",       ""},            // --
-  {"japanese-euc",                     ""},            // XEmacs?
+  {"japanese-euc",                     "EUC-JP"},      // *
   {"japanese-shift-jis",               "Shift_JIS"},
   {"japanese-shift-jisx0213",          ""},            // XEmacs?
   {"junet",                            "ISO-2022-JP"},
-  {"koi8",                             "KOI8-R"},
+  {"koi8",                             "KOI8-R"},      // not KOI8 !
   {"koi8-r",                           "KOI8-R"},
   {"koi8-t",                           "KOI8-T"},      // *
   {"koi8-u",                           "KOI8-U"},
-  {"korean-euc",                       ""},            // XEmacs?
+  {"korean-euc",                       "EUC-KR"},
   {"korean-iso-7bit-lock",             "ISO-2022-KR"},
   {"korean-iso-8bit",                  "EUC-KR"},
   {"korean-iso-8bit-with-esc",         ""},            // --
@@ -267,7 +267,7 @@
   {"raw-text",                         ""},            // --
   {"ruscii",                           "cp1125"},      // *
   {"shift_jis",                                "Shift_JIS"},
-  {"shift_jisx0213",                   ""},            // XEmacs?
+  {"shift_jisx0213",                   "Shift_JISX0213"}, // *
   {"sjis",                             "Shift_JIS"},
   {"tcvn",                             "TCVN"},        // *
   {"tcvn-5712",                                "TCVN"},        // *
@@ -320,11 +320,11 @@
 emacs2mime(char *emacs_enc)
 {
   int emacs_enc_len = strlen(emacs_enc);
-  if (!strcasecmp(emacs_enc + emacs_enc_len - 4, "-dos"))
+  if (emacs_enc_len > 4 && !strcasecmp(emacs_enc + emacs_enc_len - 4, "-dos"))
     emacs_enc[emacs_enc_len - 4] = 0;
-  if (!strcasecmp(emacs_enc + emacs_enc_len - 4, "-mac"))
+  if (emacs_enc_len > 4 && !strcasecmp(emacs_enc + emacs_enc_len - 4, "-mac"))
     emacs_enc[emacs_enc_len - 4] = 0;
-  if (!strcasecmp(emacs_enc + emacs_enc_len - 5, "-unix"))
+  if (emacs_enc_len > 5 && !strcasecmp(emacs_enc + emacs_enc_len - 5, "-unix"))
     emacs_enc[emacs_enc_len - 5] = 0;
   for (const conversion *table = emacs_to_mime; table->from; table++)
     if (!strcasecmp(emacs_enc, table->from))
@@ -662,6 +662,7 @@
     }
     read_start = inbuf + inbytes_left;
   }
+  iconv_close(handle);
   // XXX use ferror?
   limit = (char *)outbuf + BUFSIZ * sizeof (int) - outbytes_left;
   for (int *ptr = outbuf; (char *)ptr < limit; ptr++)
@@ -696,7 +697,7 @@
     const char *str;
   } BOM_table[] = {
     {4, "\x00\x00\xFE\xFF"},
-//  {4, "\xFF\xFE\x00\x00"},
+    {4, "\xFF\xFE\x00\x00"},
     {3, "\xEF\xBB\xBF"},
     {2, "\xFE\xFF"},
     {2, "\xFF\xFE"},
@@ -961,7 +962,12 @@
   encoding_string[MAX_VAR_LEN - 1] = 0;
   encoding = encoding_string;
   // Translate from MIME & Emacs encoding names to locale encoding names.
-  encoding = emacs2mime(encoding);
+  encoding = emacs2mime(encoding_string);
+  if (encoding[0] == '\0') {
+    error("encoding `%1' not supported, not a portable encoding",
+          encoding_string);
+    return 0;
+  }
   if (debug)
     fprintf(stderr, "  encoding used: `%s'\n", encoding);
   data = BOM + data;
[Prev in Thread]
Current Thread
[Next in Thread]
[Groff] preconv now in groff CVS, Werner LEMBERG, 2005/12/30
- [Groff] preconv, Bruno Haible <=
- [Groff] preconv autoconfigury, Bruno Haible, 2005/12/31
- [Groff] preconv supported encodings, Bruno Haible, 2005/12/31
Prev by Date: [Groff] Re: What's missing for Unicode support of groff?
Next by Date: [Groff] preconv autoconfigury
Previous by thread: [Groff] preconv now in groff CVS
Next by thread: [Groff] preconv autoconfigury
Index(es):
- Date
- Thread