Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't

pspp-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't

From:	John Darrington
Subject:	Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right.
Date:	Thu, 1 Mar 2012 21:08:10 +0000
User-agent:	Mutt/1.5.18 (2008-05-17)
It looks fine to me.

Would it be possible to generalize it?  That is to say, could we make
sure, that when the fallback encoding is X, where X is multi-byte encoding,
but we know that the input is not X, that it also falls back to windows-1252?

On Wed, Feb 29, 2012 at 10:44:30PM -0800, Ben Pfaff wrote:
     Until now the encoding-guesser code has used UTF-8 as a fallback in
     situations where we can tell that the file is not valid UTF-8.  In
     this kind of situation having a single-byte character set as a
     fallback makes more sense.  This commit hard-codes windows-1252 as
     that fallback, since it is a widely encountered encoding (and
     compatible with ISO-8859-1 as well).
     
     John Darrington originally suggested this, if I recall correctly.
     
     The bug report that spurred this work was from Harry Thijssen.  With
     this commit, PSPP properly reads his windows-1252 file when the
     system locale uses UTF-8 encoding.
     ---
     I'm looking for a review of this patch before I push it to master.
     Thanks!
     
      doc/utilities.texi                |   23 +++++++++++-----
      src/libpspp/encoding-guesser.c    |   55 
++++++++++++++++++++++++-------------
      src/libpspp/encoding-guesser.h    |    8 +++--
      src/libpspp/i18n.c                |   14 +++++++++
      src/libpspp/i18n.h                |    2 +
      src/libpspp/u8-istream.c          |    5 ++-
      tests/libpspp/encoding-guesser.at |    8 +++++
      7 files changed, 84 insertions(+), 31 deletions(-)
     
     diff --git a/doc/utilities.texi b/doc/utilities.texi
     index 40648d4..35dd393 100644
     --- a/doc/utilities.texi
     +++ b/doc/utilities.texi
     @@ -313,14 +313,23 @@ are @code{ASCII} (United States), @code{ISO-8859-1} 
(western Europe),
      @code{EUC-JP} (Japan), and @code{windows-1252} (Windows).  Not all
      systems support all character sets.
      
     address@hidden @code{Auto}
      @item @code{Auto,@var{encoding}}
     -Automatically detects whether a syntax file is encoded in
     address@hidden or in a Unicode encoding such as UTF-8, UTF-16, or
     -UTF-32.  The @var{encoding} may be an IANA character set name or
     address@hidden (the default).  Only ASCII compatible encodings can
     -automatically be distinguished from UTF-8 (the most common locale
     -encodings are all ASCII-compatible).
     +Automatically detects whether a syntax file is encoded in an Unicode
     +encoding such as UTF-8, UTF-16, or UTF-32.  If it is not, then PSPP
     +generally assumes that the file is encoded in @var{encoding} (an IANA
     +character set name).  However, if @var{encoding} is UTF-8, and the
     +syntax file is not valid UTF-8, PSPP instead assumes that the file
     +is encoded in @code{windows-1252}.
     +
     +For best results, @var{encoding} should be an ASCII-compatible
     +encoding (the most common locale encodings are all ASCII-compatible),
     +because encodings that are not ASCII compatible cannot be
     +automatically distinguished from UTF-8.
     +
     address@hidden @code{Auto}
     address@hidden @code{Auto,Locale}
     +Automatic detection, as above, with the default encoding taken from
     +the system locale or the setting on SET LOCALE.
      @end table
      
      When ENCODING is not specified, the default is taken from the
     diff --git a/src/libpspp/encoding-guesser.c 
b/src/libpspp/encoding-guesser.c
     index 27e2cda..bee2978 100644
     --- a/src/libpspp/encoding-guesser.c
     +++ b/src/libpspp/encoding-guesser.c
     @@ -36,22 +36,26 @@
         of information about encoding detection.
      */
      
     -/* Parses and returns the fallback encoding from ENCODING, which must be 
in one
     -   of the forms described at the top of encoding-guesser.h.  The returned
     -   string might be ENCODING itself or a suffix of it, or it might be a
     -   statically allocated string. */
     +/* Returns the encoding specified by ENCODING, which must be in one of the
     +   forms described at the top of encoding-guesser.h.  The returned string 
might
     +   be ENCODING itself or a suffix of it, or it might be a statically 
allocated
     +   string. */
      const char *
      encoding_guess_parse_encoding (const char *encoding)
      {
     +  const char *fallback;
     +
        if (encoding == NULL
            || !c_strcasecmp (encoding, "auto")
            || !c_strcasecmp (encoding, "auto,locale")
            || !c_strcasecmp (encoding, "locale"))
     -    return locale_charset ();
     +    fallback = locale_charset ();
        else if (!c_strncasecmp (encoding, "auto,", 5))
     -    return encoding + 5;
     +    fallback = encoding + 5;
        else
          return encoding;
     +
     +  return is_encoding_utf8 (fallback) ? "windows-1252" : fallback;
      }
      
      /* Returns true if ENCODING, which must be in one of the forms described 
at the
     @@ -267,16 +271,37 @@ const char *
      encoding_guess_tail_encoding (const char *encoding,
                                    const void *data, size_t n)
      {
     -  return (encoding_guess_tail_is_utf8 (data, n)
     +  return (encoding_guess_tail_is_utf8 (data, n) != 0
                ? "UTF-8"
                : encoding_guess_parse_encoding (encoding));
      }
      
     -/* Same as encoding_guess_tail_encoding() but returns true for UTF-8 or 
false
     -   for the fallback encoding. */
     -bool
     +/* Returns an encoding guess based on ENCODING and the N bytes of text 
starting
     +   at DATA.  DATA should start with the first non-ASCII text character (as
     +   determined by encoding_guess_is_ascii_text()) found in the input.
     +
     +   The return value is:
     +
     +       0, if the encoding is definitely not UTF-8 (because the input 
contains
     +       byte sequences that are not valid in UTF-8).
     +
     +       1, if the encoding appears to be UTF-8 (because the input contains 
valid
     +       UTF-8 multibyte sequences).
     +
     +       -1, if the input contains only ASCII characters.  (This means that 
the
     +       input may be treated as UTF-8, since ASCII is a subset of UTF-8.)
     +
     +   See encoding-guesser.h for intended use of this function.
     +
     +   N must be at least ENCODING_GUESS_MIN, unless the file has fewer bytes 
than
     +   that starting with the first non-ASCII text character. */
     +int
      encoding_guess_tail_is_utf8 (const void *data, size_t n)
      {
     +  /* If all the bytes are in the ASCII range, it's just ASCII. */
     +  if (encoding_guess_count_ascii (data, n) == n)
     +    return -1;
     +
        return (n < ENCODING_GUESS_MIN
                ? u8_check (data, n) == NULL
                : is_all_utf8_text (data, n));
     @@ -297,15 +322,7 @@ encoding_guess_whole_file (const char *encoding, 
const void *text, size_t size)
      
        guess = encoding_guess_head_encoding (encoding, text, size);
        if (!strcmp (guess, "ASCII") && encoding_guess_encoding_is_auto 
(encoding))
     -    {
     -      size_t ofs = encoding_guess_count_ascii (text, size);
     -      if (ofs < size)
     -        return encoding_guess_tail_encoding (encoding,
     -                                             (const char *) text + ofs,
     -                                             size - ofs);
     -      else
     -        return encoding_guess_parse_encoding (encoding);
     -    }
     +    return encoding_guess_tail_encoding (encoding, text, size);
        else
          return guess;
      }
     diff --git a/src/libpspp/encoding-guesser.h 
b/src/libpspp/encoding-guesser.h
     index 0a7d1f9..2e8cb9a 100644
     --- a/src/libpspp/encoding-guesser.h
     +++ b/src/libpspp/encoding-guesser.h
     @@ -1,5 +1,5 @@
      /* PSPP - a program for statistical analysis.
     -   Copyright (C) 2011 Free Software Foundation, Inc.
     +   Copyright (C) 2011, 2012 Free Software Foundation, Inc.
      
         This program is free software: you can redistribute it and/or modify
         it under the terms of the GNU General Public License as published by
     @@ -42,7 +42,9 @@
             encoding"): Requests detection whether the input is encoded in 
UTF-8,
             UTF-16, UTF-32, or a few other easily identifiable charsets.  When 
a
             particular character set cannot be recognized, the guesser falls 
back to
     -       the encoding following the comma.  UTF-8 detection works only for
     +       the encoding following the comma.  When the fallback encoding is 
UTF-8,
     +       but the input is invalid UTF-8, then the windows-1252 encoding 
(closely
     +       related to ISO 8859-1) is used instead.  UTF-8 detection works 
only for
             ASCII-compatible character sets.
      
           - NULL or "Auto": As above, with the encoding used by the system 
locale as
     @@ -111,7 +113,7 @@ const char *encoding_guess_head_encoding (const char 
*encoding,
      /* Refining an initial ASCII coding guess using later non-ASCII bytes. */
      static inline bool encoding_guess_is_ascii_text (uint8_t c);
      size_t encoding_guess_count_ascii (const void *, size_t);
     -bool encoding_guess_tail_is_utf8 (const void *, size_t);
     +int encoding_guess_tail_is_utf8 (const void *, size_t);
      const char *encoding_guess_tail_encoding (const char *encoding,
                                                const void *, size_t);
      
     diff --git a/src/libpspp/i18n.c b/src/libpspp/i18n.c
     index 9658866..c04dd5a 100644
     --- a/src/libpspp/i18n.c
     +++ b/src/libpspp/i18n.c
     @@ -769,3 +769,17 @@ is_encoding_supported (const char *encoding)
        return (create_iconv__ ("UTF-8", encoding)->conv != (iconv_t) -1
                && create_iconv__ (encoding, "UTF-8")->conv != (iconv_t) -1);
      }
     +
     +/* Returns true if E is the name of a UTF-8 encoding.
     +
     +   XXX Possibly we should test not E as a string but its properties via
     +   iconv. */
     +bool
     +is_encoding_utf8 (const char *e)
     +{
     +  return ((e[0] == 'u' || e[0] == 'U')
     +          && (e[1] == 't' || e[1] == 'T')
     +          && (e[2] == 'f' || e[2] == 'F')
     +          && ((e[3] == '8' && e[4] == '\0')
     +              || (e[3] == '-' && e[4] == '8' && e[5] == '\0')));
     +}
     diff --git a/src/libpspp/i18n.h b/src/libpspp/i18n.h
     index 383ff12..d973a81 100644
     --- a/src/libpspp/i18n.h
     +++ b/src/libpspp/i18n.h
     @@ -142,4 +142,6 @@ bool is_encoding_ascii_compatible (const char 
*encoding);
      bool is_encoding_ebcdic_compatible (const char *encoding);
      bool is_encoding_supported (const char *encoding);
      
     +bool is_encoding_utf8 (const char *encoding);
     +
      #endif /* i18n.h */
     diff --git a/src/libpspp/u8-istream.c b/src/libpspp/u8-istream.c
     index c111634..77c1413 100644
     --- a/src/libpspp/u8-istream.c
     +++ b/src/libpspp/u8-istream.c
     @@ -1,5 +1,5 @@
      /* PSPP - a program for statistical analysis.
     -   Copyright (C) 2010, 2011 Free Software Foundation, Inc.
     +   Copyright (C) 2010, 2011, 2012 Free Software Foundation, Inc.
      
         This program is free software: you can redistribute it and/or modify
         it under the terms of the GNU General Public License as published by
     @@ -34,6 +34,7 @@
      #include "libpspp/cast.h"
      #include "libpspp/compiler.h"
      #include "libpspp/encoding-guesser.h"
     +#include "libpspp/i18n.h"
      
      #include "gl/c-strcase.h"
      #include "gl/localcharset.h"
     @@ -120,7 +121,7 @@ u8_istream_for_fd (const char *fromcode, int fd)
          goto error;
      
        encoding = encoding_guess_head_encoding (fromcode, is->buffer, 
is->length);
     -  if (!strcmp (encoding, "UTF-8"))
     +  if (is_encoding_utf8 (encoding))
          is->state = S_UTF8;
        else
          {
     diff --git a/tests/libpspp/encoding-guesser.at 
b/tests/libpspp/encoding-guesser.at
     index a2b0aab..e969a48 100644
     --- a/tests/libpspp/encoding-guesser.at
     +++ b/tests/libpspp/encoding-guesser.at
     @@ -141,3 +141,11 @@ AT_CHECK([printf 
'\343\201\201\343\201\202\343\201\203\343\201\204\343\201\205\3
        [0], [UTF-8
      ])
      AT_CLEANUP
     +
     +AT_SETUP([windows-1252 as Auto,UTF-8])
     +AT_KEYWORDS([encoding guesser])
     +AT_CHECK([i18n-test supports_encodings windows-1252])
     +AT_CHECK([printf 'entr\351e' | encoding-guesser-test Auto,UTF-8 32], [0],
     +  [windows-1252
     +])
     +AT_CLEANUP
     -- 
     1.7.2.5
     
     
     _______________________________________________
     pspp-dev mailing list
     address@hidden
     https://lists.gnu.org/mailman/listinfo/pspp-dev

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://keys.gnupg.net or any PGP keyserver for public key.
signature.asc
Description: Digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
[PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., Ben Pfaff, 2012/03/01
- Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., John Darrington <=
  - Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., Ben Pfaff, 2012/03/02
Prev by Date: Re: Debian test errors
Next by Date: Re: [Pspp-commits] [SCM] GNU PSPP branch, master, updated. v0.7.9-57-g909400c
Previous by thread: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right.
Next by thread: Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right.
Index(es):
- Date
- Thread