groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] -T html: \- rendered as something different than ASCII 45


From: G. Branden Robinson
Subject: Re: [BUG] -T html: \- rendered as something different than ASCII 45
Date: Tue, 25 Jan 2022 11:58:27 +1100
User-agent: NeoMutt/20180716

Hi Alex,

At 2022-01-24T22:13:32+0100, Alejandro Colomar wrote:
> Hi Branden,
> 
> And another html bug; however, this one seems to be a browser bug, but
> please confirm.

Maybe not.

> For the following code:
> 
> [
> .TP
> .B \(aq\-\(aq
> Empty white cell.
> ]
> 
> groff(1) generates the following HTML code:
> 
> [
> <p><b>'&minus;'</b></p></td>
> <td width="5%"></td>
> <td width="22%">
> ]
> 
> However, both firefox and chrome show something that if copy&pasted to
> a terminal is different from ASCII 45, and is longer than the proper
> minus sign.

If your system works like mine, it _is_ a "proper minus sign".

$ lynx -dump EXPERIMENTS/chess-init.6.html | sed -n '21p' | xxd
00000000: 2020 2063 6865 7373 e288 9269 6e69 740a     chess...init.

And UTF-8 E2 88 92 is...

$ unicode −
U+2212 MINUS SIGN
UTF-8: e2 88 92 UTF-16BE: 2212 Decimal: &#8722; Octal: \021022
−
Category: Sm (Symbol, Math); East Asian width: N (neutral)
Unicode block: 2200..22FF; Mathematical Operators
Bidi: ES (European Number Separator)

> Should I report a bug to firefox?

No, you're getting correct output...almost.

\- to U+2212 is a wholly legitimate mapping for troff typesetting going
back to 1973.

But man(7) pages are an issue.  There, a "real" minus sign is almost
never wanted.  It makes sense for the man(7) package to have a bespoke
mapping for the minus sign glyph to the basic Latin hyphen-minus on
devices that distinguish them.

I see the following in /etc/groff/man.local on my Debian system with
groff 1.22.4:

.  \" Debian: "\-" is more commonly used for option dashes than for minus
.  \" signs in manual pages, so map it to plain "-" for HTML/XHTML output
.  \" rather than letting it be rendered as "&minus;".
.  ie '\*[.T]'html' \
.    char \- \N'45'
.  el \{\
.    if '\*[.T]'xhtml' \
.      char \- \N'45'
.  \}

Debian shouldn't have to do that; groff should, and moreover should move
this character definition into the an.tmac file and apply it to the utf8
groff output device as well, not just (x)html.

There is a related bug in that groff's html device maps regular '-' to
the basic Latin hyphen-minus when it should become the HTML &hyphen;
entity instead.

Here's partial output from a slightly modified version of your page.

$ lynx -dump EXPERIMENTS/chess-init.6.html | sed -n '17p' | xxd
00000000: 2020 2063 6865 7373 e288 9269 6e69 7420     chess...init 
00000010: e288 9220 696e 6974 6961 6c69 7a65 2061  ... initialize a
00000020: 2063 6865 7373 2067 616d 6520 666f 7220   chess game for 
00000030: 796f 7572 206d 6f74 6865 722d 696e 2d6c  your mother-in-l
00000040: 6177 0a                                  aw.

That's wrong, but I understand why it happened.

If a man(7) page author truly wants a Unicode minus sign--perhaps for an
expansion of the unicode(7) page--they can obtain it with a special
character escape sequence: \[u2010].

So this is a bug, too: the grohtml output device needs to map - to
&hyphen;, \- to &minus; and groff's an.tmac needs to override that
mapping of \- to point it at \N'45'.

In groff Git HEAD, we have this in an.tmac:

.\" === Define/remap characters. ===
.
.\" For UTF-8, map the minus sign to the hyphen-minus to facilitate
.\" copy and paste of code examples, file names, and URLs embedding it.
.if '\*[.T]'utf8' \{\
.  char \- \N'45'
.  char  - \N'45'
.\}

As a related matter I would kill the second 'char' request (remapping
the unescaped input dash).  The first should be done not just for
'utf8', but 'html' and 'xhtml' as well.

Would you like to file this one as well?

Regards,
Branden

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]