[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Accent mystery
From: |
Robert Goulding |
Subject: |
Re: Accent mystery |
Date: |
Mon, 19 Feb 2024 12:46:13 -0500 |
Ahhhh, thank you so much (I needed to RTFM!) - R.
On Mon, Feb 19, 2024 at 12:44 PM G. Branden Robinson <
g.branden.robinson@gmail.com> wrote:
> Hi Robert,
>
> At 2024-02-19T12:40:16-0500, Robert Goulding via wrote:
> > To answer my own question: It seems that preconv is not guessing the
> > correct encoding from the file with a single word in it. If I specify
> > -K utf-8 everything works OK.
> >
> > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv
> > support and with uchardet support
> >
> > Is this an expected shortcoming of preconv - that if a file contains
> > just a single accented character, it won't guess it correctly? The
> > original file it failed on was a 2-page pdf, which has the word
> > kataskeuê in the middle of it.
>
> Yes. The man page says:
>
> Coding tags
> Text editors that support more than a single character encoding
> need tags within the input files to mark the file’s encoding.
> While it is possible to guess the right input encoding with the
> help of heuristics that produce good results for a preponderance of
> natural language texts, they are not absolutely reliable.
> Heuristics can fail on inputs that are too short or don’t represent
> a natural language.
> [...]
> The use of iconv means that characters in the input that encode
> invalid code points for that encoding may be dropped from the
> output stream or mapped to the Unicode replacement character
> (U+FFFD). Compare the following examples using the input “café”
> (note the “e” with an acute accent), which due to its short length
> challenges inference of the encoding used.
> printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
> printf 'caf\351\n' | preconv -e us-ascii
> printf 'caf\351\n' | preconv -e latin-1
> The fate of the accented “e” differs in each case. In the first,
> uchardet fails to detect an encoding (though the library on your
> system may behave differently) and preconv falls back to the locale
> settings, where octal 351 starts an incomplete UTF‐8 sequence and
> results in the Unicode replacement character. In the second, it is
> not a representable character in the declared input encoding of US‐
> ASCII and is discarded by iconv. In the last, it is correctly
> detected and mapped.
>
> Regards,
> Branden
>
--
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.
- Accent mystery, Robert Goulding, 2024/02/19
- Re: Accent mystery, Robert Goulding, 2024/02/19
- Re: Accent mystery, G. Branden Robinson, 2024/02/19
- Re: Accent mystery,
Robert Goulding <=
- Re: Accent mystery, Peter Schaffter, 2024/02/19
- Re: Accent mystery, G. Branden Robinson, 2024/02/20
- Re: Accent mystery, Peter Schaffter, 2024/02/20
- Re: Accent mystery, Tadziu Hoffmann, 2024/02/20
- Re: Accent mystery, Peter Schaffter, 2024/02/20
- Re: Accent mystery, Tadziu Hoffmann, 2024/02/20
- Re: Accent mystery, G. Branden Robinson, 2024/02/21
- Re: Accent mystery, Peter Schaffter, 2024/02/21
- Re: Accent mystery, G. Branden Robinson, 2024/02/21