Re: 1.23: UTF-8 device produces mysterious characters

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.23: UTF-8 device produces mysterious characters

From:	G. Branden Robinson
Subject:	Re: 1.23: UTF-8 device produces mysterious characters
Date:	Tue, 13 Sep 2022 00:34:00 -0500

Hi Steffen,

At 2022-09-12T23:41:34+0200, Steffen Nurpmeso wrote:
> This is not a hyphenated word.
[rearranging this a bit]
> En dash would look nice, i could imagine.

Then use en dashes in your input.

  on\[en]loop\[en]main\[en] tick

The "en" special character identifier is not portable back to Ossanna
troff, but I will guess that that's not a major concern of yours.  Any
descendant of AT&T device-independent troff can define new special
characters in its font description files, even just to make them aliases
of existing glyphs.

>  |2.  This is not a "1.23"-specific issue as your subject line[]
>  |suggests.
>  |
>  |$ groff --version | head -n 1
>  |GNU groff version 1.22.4
> 
> Ok.. this i did not know.  Until last week i was solely using
> 1.22.3, even if the system has 1.22.4 (just not for me).

I don't _think_ this was a change in groff 1.22.4, either, but it's not
easy for me to run groff 1.22.3 or earlier to experiment on them.

>  |3.  If you're secretly in a man page context but didn't disclose
>  |    that, then, yes, this is a change from groff 1.22.4.  The
>  |    hyphen-minus, neutral apostrophe, and grave accent no longer map
>  |    differently for man(7) and mdoc(7) than for any other macro
>  |    package.  (\- still does and there is no prospect of that
>  |    changing, since there is no *roff special character defined for
>  |    the "ASCII hyphen-minus", and it is essential to express this
>  |    precise character in man pages.  These issues have been
>  |    discussed at some length on this mailing list over the past
>  |    three years.)
> 
> Really.  The above is just wrong, Branden.  Who said such?

The consensus of this list was that it is better typography overall.
Several people did question the wisdom of removing groff's override of
the normal rules of character to glyph mapping specifically for man
pages, since man page authors have gotten used to the temporary
workaround that was put in place several years ago while everyone got
accustomed to the novelty of rendering man pages on UTF-8-capable
terminals.

The groff 1.23 documentation address this issue, with regrettably
necessary redundancy, in several places: groff(7), groff_char(7),
groff_man_style(7), and the PROBLEMS file.  I will quote the last.

----------------------------------------------------------------------

* When viewing man pages, some characters on my UTF-8 terminal emulator
  look funny or copy-and-paste wrong.  Why?

Some Unicode Basic Latin ("ASCII") input characters are mapped to
non-Basic Latin code points in output for consistency with other output
devices, like PDF.  See groff_man_style(7) and groff_char(7) for correct
input conventions and background.  If you use the correct groff special
character escape sequences to input them, you will get correct output no
matter what device the input is formatted for.

However, many man pages are written in ignorance of the correct special
characters to obtain the desired glyphs.  You can conceal these errors
by adding the following to your site-local man(7) configuration.  The
file is called "man.local"; its installation directory depends on how
groff was configured when it was built.

--- start ---
.if '\*[.T]'utf8' \{\
.  char ' \[aq]
.  char - \-
.  char ^ \[ha]
.  char ` \[ga]
.  char ~ \[ti]
.\}
--- end ---

You may also wish to do the same for "mdoc.local".

In man pages (only), groff maps the minus sign special character '\-' to
the Basic Latin hyphen-minus (U+002D) because man pages require this
glyph and there is no historically established *roff input character,
ordinary or special, for obtaining it when a hyphen and minus sign are
both separately available.  To obtain a true minus sign, use the special
character escape sequences '\(mi' or '\[mi]'.

----------------------------------------------------------------------

However, it turns out the above might not be squarely relevant to your
situation after all; if what you need are en dashes, then no version of
groff _ever_ mapped '-' or '\-' to an en dash.  It may have mapped them
to the same glyph _as_ en dash, due to a limited glyph repertoire in the
output device, but that's not the same thing.  Claiming that it is would
be like claiming that \[lq] and \[rq] are "really" the same as \[dq].

> You cannot use HYPHEN for the above.

...then ask for the special character you want.  An input '-' is a
hyphen and has been since typesetter roff in 1973.

> Hyphen-minus itself, less-than, greater-than, no-break space,
> LEFT-POINTING DOUBLE ANGLE QUOTATION MARK, only to go until 0xAB.

I do not comprehend this remark.

> Or standard names like IEEE Std 1003.1™-2017, IEEE Std 1003.1-2008,
> C-language, code-level, POSIX.1-2017, built-in, this is only the first
> page of that standard.  Or the ISO C17 standard, you search for "-" in
> the official PDF, and you find it for Storage-class, absolute-value,
> floating-point, type-generic, thread-specific, and more, and we are
> still in the TOC.  No no -- no HYPHEN here!

I have to assume you are talking about PDF documents, since as far as I
know there is no official plain text version of any of these.

As I understand it, PDF has a feature that lets you render one glyph
that matches another for text search purposes.  This is known as a CMap
resource, I think.

https://wiki.pdftalk.de/doku.php?id=cmap

Deri James can probably speak with more expertise here.

If your concern is with groff's PDF output, then let's update the
Subject line.

> These are _not_ hyphenated words.

Then don't separate their components with hyphens!

> If roff can make a difference in true hyphenation points (i had to
> take a loooong look), then it could change a hyphen-minus on the
> input side with a hyphen on the output side when it really breaks
> a line at that point.

You can indeed prepare groff input in this way.

Here is a demonstration.  Format it for -Tutf8.

.ll 10n
.na
.\"cflags 4 \[en]
on\[en]\:loop\[en]\:main\[en] antidisestablishmentarianism
.pl \n[nl]u

You can do without the \: escape sequences if you uncomment the `cflags`
request.  You will see distinct hyphens and en dashes in the output.

> Otherwise hyphen-minus is the only viable alternative.

That's a counterfactual hypothesis, so let's not worry about it.

> Or look at the Unicode standard, where real great minds with
> incredible multi-national professional life careers are involved,
> get the official PDF (hr-hrm, i have not updated since Unicode
> 13..), combined words are separated with hyphen-minus, _not_
> hyphen.

I am dubious of this claim.  I would like to see how you verified it.  I
hope it wasn't by using a search box in a PDF viewer; that is not valid
reasoning thanks to the CMap feature.

A document author wanting to "combine" words can do so using whatever
glyph they like.  groff may require additional configuration, as with
the `cflags` request, to support unorthodox choices.

I have seen some sources in English typography claim that the
conjunction of multiple words that might themselves be subject to
hyphenation should use an en dash instead of an explicit hyphen to join
them.  However, by default groff does _not_ break after en dashes.  I
don't know why this is the case; it has been true for a long time.  I
can research it, but my guess is 20+ years, and possibly all the way
back to James Clark.  So, again, you would be protesting features of
groff that are not recent changes, and, again, the "1.23" prefix in your
email's subject line would be misleading.

> This is really wrong.

I'm sorry, but I don't think you've done your homework here.

>  |4. "on-main-loop-tick" doesn't look a natural language word to
>  |   me--it looks like an identifier in a programming language (maybe
>  |   some dialect of Lisp).  If that is the case, those hyphens need
>  |   to be spelled "\-" in the source code.  This has always been true
>  |   in man
> 
> Well, yes and no.  Hyphen is just everywhere in 1.23.

This claim is so informal as to be dangerously vague, but insofar as I
can interpret it--NO, it isn't.  Read the groff_char(7) man page in a
UTF-8 terminal and look at the glyph tables.

> Yeees, well, i really had to look you know.  This is a language
> and there was development and it was a lot of woolding.
> 
>   -.th MAIL I 10/25/72
>   -.sh NAME
>   -mail  \*-  send mail to another user

I'm well aware of this.  The `-` string was a typesetter roff-era
(1972/1973) innovation to cope with the fact that the hyphen and the
minus were distinct glyphs on the Graphic Systems C/A/T.  The "History"
section of the 1.23 groff_char(7) man page discusses this.

The man(7) macro package, introduced in 1979, did _not_ define such a
string.  It defined strings `R` and `S` as part of its interface, and
several for internal purposes that started with '['.  The groff_man(7)
man page also discusses the strings that are available for document
composition; all are deprecated, and none render a hyphen or dash.

> Who says it is not an evolution of the above?

This question is too vague for me to interpret and, to the extent that
it suggests an evolution _from_ groff in 2022 or man(7) in 1979 _to_ man
pages in 1972, incoherent.

> Doug McIlroy is on this list, maybe he reads and knows.

Knows what?  I dare say you're more likely to elicit his input if you
ask a question that is specific, carefully worded, and in idiomatic
English.

> Though he said something about the NATO today, and that lying
> aggressive Endsieg beast is definetely on the other side of the
> road.

I enjoy a humorous aside as much as anyone, perhaps more than most, but
this, and the stuff I elided about feminism and the revolutions of 1848
falls flat when your audience has no idea what you're talking about.[2]
Also, if Doug said something about NATO in some forum other than this
one, how are people on this list supposed to know about it?

> And by the way, you mention flags in the above.  Flags are
> different, because often you want this to be a U+2013 EN DASH.

Certainly not for copy and paste purposes!

> Ie, you want to make it _longer_ than a hyphen-minus.

A *roff document author has limited control over the fonts that will be
used to render the document.  I don't know that the length of a
"hyphen-minus"[3] relative to hyphens and en dashes is codified
anywhere, and even if it were, I don't know of a conformance suite that
encourages font designers to adhere to any such rule.

Common practice appears to be to give the hyphen-minus a length that
enables it to be interpreted as either a hyphen or a minus sign (big
surprise).

> Not super short like a hyphen.  Imho.

How is this not already the case?

>  |5.  Searching is not impossible.
>  |    5a. Searching for a word that is broken and hyphenated across
>  |        lines is no more impossible than it always was.  On
>  |        occasions when I have to do this, I break out sed(1) or
>  |        perl(1).
> 
> It is not hyphenated, Branden.

As I said above, if en dashes are what you want, you can get them.  You
have to ask for them.  `-` is not a magical character that means "give
me the dash-like character that I'm thinking of".  groff is not a DWIM
system.

>  |    5b. Literals that might be of interest in man pages should be
>  |        entered with hyphenation suppressed in the input.
> 
> Hey!  This is not rocket science or something.
> I am happy if people at least do _write_ manuals _at_all_.

Me too.  I am happier if they are receptive to suggestions for correct
practice once they have completed an initial draft.  That includes
knowing when to use \% and \:.  See groff 1.23's groff_man_style(7) man
page, section "Portability".

>  |    5e. For me, anyway, searching within less(1) using the pattern
>  |        with a dot where the hyphen goes works fine, even though
>  |        there are 3 bytes in the input stream instead of one.
>  |        Evidently less(1) is
> 
> Fuzzy-search code-wise? ;)

My hypothesis is that less(1) treats '.' as standing for any single code
point rather than any single byte in the input stream.  (I have no idea
what it does about code points with combining semantics and dread
finding out.)

>  |        smart enough.  For instance, I can match "line-ending" in
>  |        the roff(7) page while paging it with "groff -Tutf8 -man |
>  |        less -R" by entering "/line.ending" within less(1).
>  |
>  |I hope this clears some things up.
> 
> Certainly not for me.  Hyphen is good at the end of line when
> a word is hyphenated, otherwise it is misplaced.

This is not universally true, and emphatically not the case in English
prose.

> And using hyphen to combine words is wrong.

This claim is overstated.  But it's also largely irrelevant.  The text
formatter groff doesn't "combine words" except as directed by the user.
If you don't want them combined with hyphens, _don't combine them using
the input character that is defined to produce a hyphen in output_.

> En dash would look nice, i could imagine.

I reiterate: that option is available to you; see above.

Regards,
Branden

[1] The V4 Unix Programmer's Manual contains the first troff(1) man
    page, dated 15 January 1973, but as Steffen's exhibit suggests, the
    program may have been in use months prior to that date.  Or, the
    author of mail(1) man page forgot to update the revision date when
    the `-` string was introduced.

[2] I try, not always successfully, to put my more esoteric quips into
    footnotes.

[3] The "hyphen-minus" was, I gather, an entity unknown to typographers
    in, say, 1970.  It exists because early computer character encoding
    standards, like ASCII, had limited glyph repertoires and overloaded
    many glyphs.  Once again, you can read more about this, with
    pointers to further reading, in groff 1.23's groff_char(7).

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/12
- Re: 1.23: UTF-8 device produces mysterious characters, G. Branden Robinson, 2022/09/12
  - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/12
    - Re: 1.23: UTF-8 device produces mysterious characters, Dave Kemper, 2022/09/12
    - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Dave Kemper, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Dave Kemper, 2022/09/14
    - Re: 1.23: UTF-8 device produces mysterious characters, G. Branden Robinson <=
    - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Dave Kemper, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Ralph Corderoy, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/13
    - Re: 1.23: UTF-8 device produces mysterious characters, Ralph Corderoy, 2022/09/14
    - Re: 1.23: UTF-8 device produces mysterious characters, Steffen Nurpmeso, 2022/09/14

Prev by Date: Re: 1.23: UTF-8 device produces mysterious characters
Next by Date: Re: [bug #62955] [PATCH] [grops] \(va fallback character overrides glyph available in S font]
Previous by thread: Re: 1.23: UTF-8 device produces mysterious characters
Next by thread: Re: 1.23: UTF-8 device produces mysterious characters
Index(es):
- Date
- Thread