groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Do Latin-2-based hyphenation files work with Unicode?


From: G. Branden Robinson
Subject: Re: Do Latin-2-based hyphenation files work with Unicode?
Date: Wed, 13 Nov 2024 12:25:36 -0600

Hi onf,

At 2024-11-13T18:22:43+0100, onf wrote:
> On Wed Nov 13, 2024 at 4:36 PM CET, G. Branden Robinson wrote:
> > [...]
> > > Latin1 characters continue working even when loading Latin2 as
> > > long as they are specified as the respective UTF-8 codes.
> >
> > And they _should_ continue to hyphenate at appropriate locations
> > because s set of hyphenation codes is associated with the
> > hyphenation _language code_ ("en", "cs", "fr", etc.), which can
> > change from environment to environment.
> > [...]
> 
> Ah yeah, I forgot about the functionality of .hla. I wonder if this
> works even when each language uses a different encoding though, given
> that .tr and .trin are specified like so in groff(7):
>   .tr abcd...
>       Translate ordinary or special characters a to b, c to d, and
>       so on prior to output.
> 
>    .trin abcd...
>       As .tr, except that .asciify ignores the translation when a
>       diversion is interpolated.

These are good questions.  I have vague plans to make character `.tr`
translations environment-specific, because they have historically been
used in a way that presumes they are, even though they aren't.

There's a legacy document--I'll have to scare it up--that invokes `tr`
to temporarily apply character translations to some body text, and ends
up screwing up a page heading because a page break occurred in the
region of body text.  This was clearly not intended by the document
author.  And without jumping through hoops (defining a register that
means "I'VE SET UP TRANSLATIONS" and having the header trap check it and
undo the translations if necessary, which in turn is pretty challenging
when using an existing macro package like ms/me/man/mm/mdoc), it's just
not feasible, because character translations are _global_ in troff.

groff's character definitions (.char and friends), by contrast, I
believe should remain global.

`.trin` I think has somewhat of a different application.  It's _only_
used, it seems, to cope with the input character encoding not being ISO
8859-1.  Possibly the request can be retired when GNU troff uses a wider
character type internally and assumes UTF-8 input.

> i.e. translation should happen on output, not on input,

I'm not sure I agree with that, given the above.  When I see `tr` used,
it is typically to make input more convenient.

Here's an example from the NetHack Guidebook:

.ft CR
.tr -\-@\(rs
.TS
box center expand;
C C.
y  k  u 7  8  9
@ | /   @ | /
h- . -l 4- . -6
/ | @   / | @
b  j  n 1  2  3
\fR(\fBnumber_pad\fP off)       \fR(\fBnumber_pad\fP on)
.TE
.tr --@@
.ft R

You can see that it's easier to verify that the ASCII art lines up
correctly if you set up a translation to represent the backslash.

Another approach sometimes seen is to change or disable the escape
character, but that solution is not available here because tbl(1) is
used for this section of the input.

> meaning that using .hla might not be sufficient to switch between cs
> and fr, because that doesn't switch the encoding used.

I'll have to think about this.  It might not matter in the
wide-character-type/UTF-8-reading GNU troff future.

While I don't have an ETA for that, I don't want to complicate the
formatter itself with any features to make eight-bit encodings more
convenient to use.  That feels like throwing good money after bad.
UTF-8 is the future.  Heck, it's the present, most places.

> That's just my thoughts based on the documentation, though.  I don't
> have the time to verify this.

It does take time to research these issues.

> groff(7) does mention it, but it's among the last things mentioned in
> the Hyphenation section. The texinfo manual doesn't mention it at all
> in its section 5.1.3 about Hyphenation where I would expect it.  (At
> least the online version -- I haven't found any git source for it,
> just tarballs.)

You can review up-to-date documentation here:

https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0

The Git source for the bleeding edge of our documentation is at:

https://git.savannah.gnu.org/cgit/groff.git/tree/doc/
https://git.savannah.gnu.org/cgit/groff.git/tree/man/

> The reason I was suggesting this is the fact that once one disables
> hyphenation through .nh or .hy 0, the only way to re-enable it, as far
> as I am aware, is to issue .hy with the proper hyphenation mode, which
> depends on the language and might not be known by the user.
> 
> Separating the hyphenation portion into its own macro file would allow
> one to re-enable hyphenation by issuing .mso hyLANG.tmac instead of
> having to research the appropriate mode for the given language.
> 
> A simple macro could then be constructed which would offer a
> friendlier interface to hyphenation. It could work like this: .HY
> Return to previous hyphenation settings (if set).  .HY 0 Disable
> hyphenation.  .HY LANG Set hyphenation parameters appropriate for
> language LANG.
> 
> This would allow usage like so: .HY cs Příliš žluťoučký kůň úpěl
> ďábelské ódy...  .br .HY 0 .na
> https://\:www.gnu.org/\:software/\:groff/\:manual/\:groff.html .br .ad
> .HY

> Of course, this wouldn't be necessary if .hy worked like .ad,

That's actually a bad example, but a very popular misconception.  You
probably mean "if .hy worked like .ps". Or .ft, .ev, .in, .ll,
.ls, .lt, .po, or .vs;, or groff's .fam, .fcolor, .gcolor, or .pvs.

Without an argument, neither .hy nor .ad restore the "previous"
hypenation mode or adjustment setting, respectively.

$ cat EXPERIMENTS/argumentless-hy.groff
.tm A: .hy=\n[.hy]
.hy 14
.tm B: .hy=\n[.hy]
.hy
.tm C: .hy=\n[.hy]
$ ~/groff-stable/bin/nroff EXPERIMENTS/argumentless-hy.groff
A: .hy=4
B: .hy=14
C: .hy=1

$ cat EXPERIMENTS/argumentless-ad.roff
.tm A: .j=\n(.j
.ad c
.tm B: .j=\n(.j
.ad r
.tm C: .j=\n(.j
.ad
.tm D: .j=\n(.j
$ ~/groff-stable/bin/nroff EXPERIMENTS/argumentless-ad.roff
A: .j=1
B: .j=3
C: .j=5
D: .j=5

I think these are horrible warts in the *roff language that an
iconoclast should have smashed years ago.  But they work fine for the
most common cases (temporary disablement with `nh` and `na`,
respectively) and for people who don't spend much time composing,
revising, and formatting documents, but instead pose as subject-matter
experts on the Internet.

>  but (unless I am mistaken again :) it doesn't and cannot due to
>  desired compatibility with AT&T troff.

You might be interested in a feature in the forthcoming groff 1.24.0:

NEWS:
*  A new request, `hydefault`, and read-only register, `.hydefault`,
   manage the default automatic hyphenation mode of an environment.
   This resolves a long-standing problem of *roff formatting.

     When processing input like this,
     .nh
     and we temporarily shut off automatic hyphenation,
     .hy
     the foregoing request would not do exactly what we expect.

   AT&T and other troffs would set the hyphenation mode to 1 instead of
   the previous value; for GNU troff this was not an appropriate value
   for the English hyphenation patterns.  (For example, "alibi" would
   break as "ali-bi" instead of "al-ibi" after this argumentless `hy`
   invocation.)  With updates to groff's localization files, the
   foregoing input now works as desired.

I have plans to fix the argumentless `ad` request, but just today I
decided to kick that out past 1.24.

https://savannah.gnu.org/bugs/?65954

Regards,
Branden

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]