[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 1.23 prints some strange error
From: |
Walter Alejandro Iglesias |
Subject: |
Re: 1.23 prints some strange error |
Date: |
Wed, 25 Oct 2023 14:25:42 +0200 |
On Wed, Oct 25, 2023 at 05:03:36AM -0500, G. Branden Robinson wrote:
> Hi Walter & Dave,
>
> At 2023-09-11T19:45:30+0200, Walter Alejandro Iglesias wrote:
> > If instead of sourcing hyphen.tr from my macros with .mso I source it
> > directly from the roff document with .so those error messages
> > desapear.
>
> As Dave mentioned, this is explained by soelim(1) not being run on the
> "macro sourced" file. As a rule, I think files to be read with the
> `mso` request should be in plain ASCII only. The whole point of a macro
> file suitable for general use is that it...gets used generally, which
> means that documents employing a variety of input encodings might employ
> it. You therefore should use the lowest common denominator character
> encoding for it: ASCII. (Strictly, ISO 646:1991-IRV.)
>
> That doesn't mean you have to do much more work or spend a lot of time
> staring at groff_char(7) and learning the special character identifiers
> for the upper half of ISO 8859-1. You can still have your macro sourced
> file in Latin-1; just run preconv over it stand-alone as a converter.
>
> $ printf '.ds aunt la t\\355a\n' > family.mso.in
> $ preconv -e latin1 family.mso.in > family.mso
>
> Part of the preconv(1) man page is likely worth reviewing.
>
> iconv support
> [...]
> The use of iconv means that characters in the input that encode
> invalid code points for that encoding may be dropped from the
> output stream or mapped to the Unicode replacement character
> (U+FFFD). Compare the following examples using the input “café”
> (note the “e” with an acute accent), which due to its short
> length challenges inference of the encoding used.
> printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
> printf 'caf\351\n' | preconv -e us-ascii
> printf 'caf\351\n' | preconv -e latin-1
> The fate of the accented “e” differs in each case. In the first,
> uchardet fails to detect an encoding (though the library on your
> system may behave differently) and preconv falls back to the
> locale settings, where octal 351 starts an incomplete UTF‐8
> sequence and results in the Unicode replacement character. In
> the second, it is not a representable character in the declared
> input encoding of US‐ASCII and is discarded by iconv. In the
> last, it is correctly detected and mapped.
> [...]
> Limitations
> preconv cannot perform any transformation on input that it cannot
> see. Examples include files that are interpolated by
> preprocessors that run subsequently, including soelim(1); files
> included by troff itself through “so” and similar requests; and
> string definitions passed to troff through its -d command‐line
> option.
>
> Maybe I should add my adminition above about macro-sourced files to this
> man page.
>
> At 2023-09-12T11:16:58+0200, Walter Alejandro Iglesias wrote:
> > I cleaned up a bit the quoted text to make room for the following. Here
> > we go:
> >
> > $ uname -a
> > Linux bell 6.4.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.13-1
> > (2023-08-31) x86_64 GNU/Linux
> > $ groff --version | head -1
> > GNU groff version 1.23.0
> > $ mkdir test
> > $ cd test
> > $ cat << EOF > doc.tr
> > .mso list.tr
> > EOF
> > $ cat << EOF > list.tr
> > .hw a-hí
> > .hw a-ño
> > .hw ár-bol
> > .hw cu-brí-a
> > .hw e-té-re-o
> > .hw ca-mión
> > .hw ú-te-ro
> > .hw pin-güi-no
> > EOF
> > $ GROFF_TMAC_PATH=. nroff doc.tr
> > troff:./list.tr:1: error: expected ordinary or special character, got an
> > escaped '%'
> > troff:./list.tr:4: error: expected ordinary or special character, got an
> > escaped '%'
>
> This transcript isn't as useful as it could be, because it didn't
> disclose to me what character encoding was used for list.tr on the file
> system. Running the file(1) command on it and sharing that would help.
I think I said it several times that list.tr is a utf-8 file. And I
wouldn't trust file(1) on that.
>
> > As you see, from the UTF-8 chars used in Spanish (á, é, í, ó, ú, ü,
> > ñ), groff seems to only have problems with the 'í' in particular.
> > Let's try another test using preconv(1).
>
> preconv is probably using iconv(3) on your system ("preconv --version"
> will tell you). iconv's heuristics for guessing the encoding are opaque
> to groff (and to me).
In OpenBSD preconv (1.22.4) is compiled without iconv.
I had to downgrade Devuan to stable, which comes with groff 1.22.4, and
preconv compiled *with* iconv. I cannot reproduce the bug here. So,
this has all the numbers to be a regression, in your place I'd try to
figure out in with patch between 1.22.4 and the current version was
introduced.
I know that my bug report isn't as helpful as it could be, but right now
I'm doing other things, sorry.
>
> > The errors remain. Finally, I told you that changing .mso request to
> > .so made the error messages disappear, that's because in my Makefile I
> > run soelim(1) before. Last test:
> >
> > $ cat << EOF > doc.tr
> > .hla es
> > .so list.tr \" notice here I changed the request
> > Ahí, el árbol nos cubría con su sombra.
> > Un pingüino pasaba caminando por la playa.
> > EOF
> > $ preconv -e UTF-8 doc.tr | nroff | cat -s
> > troff:./list.tr:1: error: expected ordinary or special character, got an
> > escaped '%'
> > troff:./list.tr:3: error: expected ordinary or special character, got an
> > escaped '%'
> > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐
> > nando por la playa.
> > $ soelim doc.tr | preconv -e UTF-8 | nroff | cat -s
> > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐
> > nando por la playa.
> >
> > This last command throws no error, that's because soelim(1) allows
> > preconv(1) to process the list.tr file.
>
> Right, I think that's the right strategy precisely. You can maintain
> the file you want to `mso` in version control in whatever character
> encoding is comfortable for you--I'd store it as an ".in" file and have
> make(1) run preconv(1) over it when constructing documents that use it.
>
> > Anyways. My doubt comes from the fact that so far (with groff 1.22.4
> > under OpenBSD) I haven't needed to preprocess that .hw list with
> > preconv,
>
> OpenBSD is notoriously minimalistic. You might see if `preconv
> --version` there reports use of iconv...except...uh, I think revealing
> that information is something I added _after_ the groff 1.22.4 release.
Answered above.
>
> So here's another paragraph from preconv(1) that might explain the
> behavior on OpenBSD.
>
> iconv support
> While preconv recognizes all of the coding tags listed above, it
> is capable on its own of interpreting only three encodings:
> Latin‐1, code page 1047, and UTF‐8. If iconv support is
> configured at compile time and available at run time, all others
> are passed to iconv library functions, which may recognize many
> additional encoding strings. The command “preconv -v” discloses
> whether iconv support is configured.
>
> Unfortunately I don't know of an example of an encoding name that is a
> reliable test for iconv support being absent.
>
> > and that only the 'í' (iacute) triggers the error.
>
> I think this might be explained by iconv(3)'s heuristic approach.
>
> On my system, I confirmed that nothing crazy was going on with the
> following experiments.
>
> $ printf 'caf\351\n' | preconv -e latin1
> .lf 1 -
> caf\[u00E9]
> $ printf 'la t\355a\n' | preconv -e latin1 | nroff | head -n 1
> la tía
> $ printf 'la t\355a\n' | nroff -K latin1 | head -n 1
> la tía
> $ printf 'la t\355a\n' | nroff | head -n 1
> la tía
>
> At 2023-10-05T10:45:32+0200, Walter Alejandro Iglesias wrote:
> > If I feed preconv with a file already in latin1 (using UTF-8 locales
> > here) ...
> >
> > $ preconv -e utf8 list_in_latin1.tr
> >
> > ... *all* non ASCII characters in the output are replaced by \[uFFFD].
>
> Yes, because the `-e` flag _describes the character encoding of the
> input_.
>
> Description
> preconv reads each file, converts its encoded characters to a
> form troff(1) can interpret, and sends the result to the standard
> output stream.
> [...]
> Options
> [...]
> -e encoding
> Skip detection and assume encoding; see groff’s -K option.
>
> Do not try to tell preconv the desired character encoding of the
> _output_; that's not its job. Its job is to normalize the input so that
> GNU troff(1) can read it.
>
> The character encoding of the output is inapplicable to GNU troff(1)
> itself; it, like all device-independent troffs, writes an ASCII-encoded
> plain text file. An output driver like grotty(1) translates troff(1)
> output into whatever is appropriate for the device, which is why groff's
> terminal output devices are named things like "ascii", "latin1" and
> "utf8".
>
> At 2023-10-12T16:46:07-0500, Dave Kemper wrote:
> > On 10/5/23, Walter Alejandro Iglesias <wai@roquesor.com> wrote:
> > > If I feed preconv with a file already in latin1 (using UTF-8 locales
> > > here) ...
> > >
> > > $ preconv -e utf8 list_in_latin1.tr
> > >
> > > ... *all* non ASCII characters in the output are replaced by \[uFFFD].
> >
> > Yes, this would be expected to not work. preconv's "-e" option
> > specifies the *input* encoding. So if the input file is in Latin-1,
> > but you tell preconv that it's in UTF-8, you'd expect things to go
> > awry.
>
> Right.
>
> > But that's not the full explanation: *all* Latin-1 characters are
> > multiple bytes when encoded as UTF-8.
>
> Strictly, Latin-1 is an 8-bit character encoding. You might say here
> "all characters from the Unicode Latin-1 extension block" instead.
>
> Ya know, if you're a stickler.
>
> > So if iacute (Latin-1 0xED) is misread in the way Bjarni describes,
> > the same should happen to all the other Latin-1 characters as well.
> > The fact groff is treating one Latin-1 character differently from the
> > others carries the whiff of a bug.
>
> I'm prepared to chalk this up to iconv heuristic conversion in the
> absence of other information. See my attempted reproducers above.
>
> Regards,
> Branden
--
Walter
- Re: 1.23 prints some strange error, Dave Kemper, 2023/10/04
- Re: 1.23 prints some strange error, Walter Alejandro Iglesias, 2023/10/04
- Re: 1.23 prints some strange error, Bjarni Ingi Gislason, 2023/10/04
- Re: 1.23 prints some strange error, Bjarni Ingi Gislason, 2023/10/04
- Re: 1.23 prints some strange error, Walter Alejandro Iglesias, 2023/10/05
- Re: 1.23 prints some strange error, Walter Alejandro Iglesias, 2023/10/05
- Re: 1.23 prints some strange error, Dave Kemper, 2023/10/12
- Re: 1.23 prints some strange error, G. Branden Robinson, 2023/10/25
- Re: 1.23 prints some strange error,
Walter Alejandro Iglesias <=
- Re: 1.23 prints some strange error, G. Branden Robinson, 2023/10/25
- Re: 1.23 prints some strange error, Walter Alejandro Iglesias, 2023/10/25
- Re: 1.23 prints some strange error, G. Branden Robinson, 2023/10/26
- Re: 1.23 prints some strange error, Walter Alejandro Iglesias, 2023/10/26
- Re: 1.23 prints some strange error, G. Branden Robinson, 2023/10/26