groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"transparent" output and throughput, demystified


From: G. Branden Robinson
Subject: "transparent" output and throughput, demystified
Date: Fri, 30 Aug 2024 18:07:57 -0500

Hi folks,

Those of us who worked with groff 1.22.4 may remember a couple of
diagnostic messages that gobsmacked one with their incomprehensibility.

Here's the source code that produced them.

  error("can't transparently output node at top level");
  error("can't translate %1 to special character '%2'"
        " in transparent throughput",
        input_char_description(cc),
        ci->nm.contents());

For groff 1.23.0, I silenced them with a blunt instrument, since I
concluded that they weren't entirely spurious, but nobody seemed to have
any idea what, exactly, anyone was supposed to do about them.  In
general usage scenarios, a diagnostic that the user cannot address is a
diagnostic that should not be issued.

In a recent thread, Peter noted the resurrection of these messages.[1]
At that time, I promised an explanation of these long-vexing problems.

In the course of documenting my fix for Savannah #63074[2]--a process
that is fairly involved and not yet complete--I ended up writing a
change to the groff_diff(7) man page that seems to cover the bases.

Here's some language that I have queued up for my next push.  First, a
comment from the man page source that summarizes where we (I) still have
work to do.

.\" TODO: When we get this giant headache generalized and adapted to the
.\" `\!` escape sequence and `device`, `output`, `cf`, and `trf`
.\" requests, move this discussion into a dedicated subsection above.

And now, the explanation.

groff_diff(7):

     \X'contents'  GNU troff transforms the argument to the device
                   control escape sequence to avoid leaking to device‐
                   independent output data that are unrepresentable in
                   that format, and to address the problem of expressing
                   character code points outside of the Unicode basic
                   Latin range in an output file format that restricts
                   itself to that range.  (See subsection “Basic Latin”
                   of groff_char(7).)  The typesetting of such
                   characters is a problem long‐solved in device‐
                   independent troff by the “C” command; see
                   groff_out(5).  The expression of such characters in
                   other contexts, such as device extension commands,
                   was not addressed by the same design.  Where
                   possible, GNU troff represents such characters in
                   device‐independent but non‐typesetting contexts using
                   its notation for Unicode special character escape
                   sequences; see subsection “Special character escape
                   forms” of groff_char(7).

                   GNU troff converts several ordinary characters that
                   typeset as non‐basic Latin code points to code points
                   outside that range to avoid confusion when these
                   characters are used in ways that are ultimately
                   visible, as in tag names for PDF bookmarks, which can
                   appear in a viewer’s navigation pane.  These ordinary
                   characters are “'”, “-”, “^”, “`”, and “~”; others
                   are written as‐is.

                   Special characters that typeset as Unicode basic
                   Latin characters are translated to basic Latin
                   characters accordingly.  For this transformation,
                   character translations and definitions are ignored.
                   So that any Unicode code point can be represented in
                   device extension commands, for example in an author’s
                   name in document metadata or as a usefully named
                   bookmark or hyperlink anchor, GNU troff translates
                   other special characters into their Unicode special
                   character notation.  Special characters without a
                   Unicode representation, and escape sequences that do
                   not interpolate a sequence of ordinary and/or special
                   characters, produce warnings in category “char”.

I hope to boil some fat off of that.

I want to emphasize that error and warning diagnostics will remain
possible.  But they should not occur when a document or macro package
attempts to do things that are "sane", like storing accented letters
appearing in an author's name or a section heading into a string and
that interpolating that string in a device extension/control escape
sequence.

Peter's mom(7) package is more muscular even than that; I've noticed
that in his example documents he is not shy even of including vertical
motions in author names.

contrib/mom/examples/mom-pdf.mom:.AUTHOR "Deri James" "\*[UP .5p]and" "Peter 
Schaffter"

Inside the formatter, a vertical motion becomes a "node" and has no
possible representation in a device extension/control escape sequence.

In the future, I want the formatter to complain about such
impossibilities, but not yet--it's not fair to document and macro
package authors to be so prescriptive without providing a handy
mechanism for cleaning such things out of strings that are destined for
device extension/control commands.  To date, solutions have included
creating a diversion, interpolating the string inside it, then using the
`asciify` or `unformat` format requests on it to strip things like
vertical motions (and `chop` of course to rip out the undesired newline
at the end of the diversion).  This is painful because formatting things
into a diversion as a rule _creates_ more nodes than it eliminates.
Hence the unformatting.

It would be cleaner and simpler to provide a mechanism for processing a
string directly, discarding escape sequences (like vertical motions or
break points [with or without hyphenation).  This point is even more
emphatic because of the heavy representation of special characters in
known use cases.  That, to "sanitize" (or "pdfclean") such strings by
round-tripping them through a process that converts a sequence of easily
handled bytes like "\ [ 'a ]" or "\ [ u 0 4 1 1 ]" into a special
character node and then back again seems wasteful and fragile to me.

But, to get things where I'd like to see them, we an in-language string
iterator for the groff language.  And because strings, macros, and
diversions can be punned with each other, in practice that means we will
need an iterator than can handle any of these.  That, in turn, means
that we will also require a new conditional expression operator to test
whether an element of a string is a "node".

I haven't been able to get all of that together in the year since we
released groff 1.23.  My understanding of the formatter still has
significant lacunae.  Ah well.  Maybe for 1.25.

So, in the meantime, my plan is to silently discard things from device
extension/control commands that an output device would not able to do
anything useful with.

Thanks for your patience with this explan-a-thon.[3]

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2024-08/msg00045.html
[2] https://savannah.gnu.org/bugs/?63074

[3] In my opinion, the words "special" and "transparent" are the two
    most relentlessly and unhelpfully overused terms in troff
    literature.  They each mean several different things, and make me a
    cross and grumpy (semi-official) groff maintainer.  Expect
    documentary reforms addressing these ambiguities.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]