Re: Accent mystery

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Accent mystery

From:	Robert Goulding
Subject:	Re: Accent mystery
Date:	Mon, 19 Feb 2024 12:46:13 -0500

Ahhhh, thank you so much (I needed to RTFM!) - R.

On Mon, Feb 19, 2024 at 12:44 PM G. Branden Robinson <
g.branden.robinson@gmail.com> wrote:

> Hi Robert,
>
> At 2024-02-19T12:40:16-0500, Robert Goulding via wrote:
> > To answer my own question: It seems that preconv is not guessing the
> > correct encoding from the file with a single word in it.  If I specify
> > -K utf-8 everything works OK.
> >
> > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv
> > support and with uchardet support
> >
> > Is this an expected shortcoming of preconv - that if a file contains
> > just a single accented character, it won't guess it correctly? The
> > original file it failed on was a 2-page pdf, which has the word
> > kataskeuê in the middle of it.
>
> Yes.  The man page says:
>
>    Coding tags
>      Text editors that support more than a single character encoding
>      need tags within the input files to mark the file’s encoding.
>      While it is possible to guess the right input encoding with the
>      help of heuristics that produce good results for a preponderance of
>      natural language texts, they are not absolutely reliable.
>      Heuristics can fail on inputs that are too short or don’t represent
>      a natural language.
> [...]
>      The use of iconv means that characters in the input that encode
>      invalid code points for that encoding may be dropped from the
>      output stream or mapped to the Unicode replacement character
>      (U+FFFD).  Compare the following examples using the input “café”
>      (note the “e” with an acute accent), which due to its short length
>      challenges inference of the encoding used.
>             printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
>             printf 'caf\351\n' | preconv -e us-ascii
>             printf 'caf\351\n' | preconv -e latin-1
>      The fate of the accented “e” differs in each case.  In the first,
>      uchardet fails to detect an encoding (though the library on your
>      system may behave differently) and preconv falls back to the locale
>      settings, where octal 351 starts an incomplete UTF‐8 sequence and
>      results in the Unicode replacement character.  In the second, it is
>      not a representable character in the declared input encoding of US‐
>      ASCII and is discarded by iconv.  In the last, it is correctly
>      detected and mapped.
>
> Regards,
> Branden
>


-- 
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.

[Prev in Thread]

Current Thread

[Next in Thread]

Accent mystery, Robert Goulding, 2024/02/19
- Re: Accent mystery, Robert Goulding, 2024/02/19
  - Re: Accent mystery, G. Branden Robinson, 2024/02/19
    - Re: Accent mystery, Robert Goulding <=
- Re: Accent mystery, Peter Schaffter, 2024/02/19
  - Re: Accent mystery, G. Branden Robinson, 2024/02/20
    - Re: Accent mystery, Peter Schaffter, 2024/02/20
    - Re: Accent mystery, Tadziu Hoffmann, 2024/02/20
    - Re: Accent mystery, Peter Schaffter, 2024/02/20
    - Re: Accent mystery, Tadziu Hoffmann, 2024/02/20
    - Re: Accent mystery, G. Branden Robinson, 2024/02/21
    - Re: Accent mystery, Peter Schaffter, 2024/02/21
    - Re: Accent mystery, G. Branden Robinson, 2024/02/21

Prev by Date: Re: Accent mystery
Next by Date: Re: Accent mystery
Previous by thread: Re: Accent mystery
Next by thread: Re: Accent mystery
Index(es):
- Date
- Thread