groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Status of the portability work, and plans for the future


From: Eric S. Raymond
Subject: [Groff] Status of the portability work, and plans for the future
Date: Sun, 7 Jan 2007 17:54:27 -0500
User-agent: Mutt/1.4.2.2i

I've been quiet for the last several days because I've been working hard on
some of the issues I've brought up on this list.  I've included the
current draft of the report on portable troff requests below.  After
it, some discussion of what I have planned when the report is finished.

------------------------------------------------------------------------------
Third draft of the report on defining a portable subset of troff requests.

PORTABLE FEATURES:

Portable requests:
    .de .ds .fi .ft .ie .if .ig .nf .nr .rm .rn .so .sp.

The .sp macro is portable in the sense that it can be portably 
used to generate a visual paragraph break without terminating list 
markup, but use of an argument to control vertical spacing is not
portable.

While .if/.ie is in the portable set, the expression set allowable in
conditionals has to be seriously restricted to be portable across
rendering programs including doclifter and Unix *roff.  The
parenthesized form of conditional, and the groff extended logical
operators, are not portable.

                TODO: DESCRIBE PORTABLE EXPRESSIONS BETTER

Note that .br/.nl, .ti, .ta, and .in are *not* in the portable set.
These cannot be translated structurally by doclifter, and man-to-HTML
translators tend to ignore them or give useless results as well.
Fortunately. these can almost always be replaced by uses of .nf/.fi,
.RS/.RE, and tbl markup (which doclifter handles).

Portable escapes:  
    \. \^ \' \` \- \$ \* \& \| \0 \<SP>
    \d \e \f \u \n

These are almost all the escapes actually needed to interpret the
entire 13,447-page Fedora Core 6 corpus into DocBook.  The corpus
makes sporadic and very rare use of use of \v, \w, \h, \o, and \k
(approximately once each), but these are not essential and can be
patched out.

I noted previously that \w is *not* portable.  In general, we can't count
on the viewer to be able to render horizontal or vertical motions with
precision, we can't count on it to know font sizes, and we can't even
count on it to know whether its output uses fixed- or variable-width
fonts.  As it turns out, interpreting \w is not necessary for
doclifter, either -- all man-page uses (at least, in my corpus) are
either inside macros which are interpreted by other means or part of
Synopsis syntax.

Werner Lemberg wanted to know the status of \~.  I found 17 uses within the
groff documentation and 4 outside it.  Of those 4, two were errors.  So
it's not much needed for manual pages, which is a good thing as it is 
not portable.  In particular, I was unable to discover any corresponding
ISO entity or Unicode character.

Portable glyphs:

The glyphs \*R, \(Tm, \*(lq and \*(rq (registered, trademark, left quote, and
right quote) are described on groff_man.  Every man-page viewer I examined 
except the crufty old Perl man2html supports these.

I think we can declare Latin-1 and the intersection of groff glyphs with
HTML entities portable as well, but verifying this will need more work and
it will require ignoring the limitations of some obsolete translators such as
the Perl man2html.

                TODO: NEEDS MORE INVESTIGATION

Portable registers:

After investigating the groff builtin registers, I have concluded that
the only portable built-in register is .$, the macro argument count
register.  Any other time troff markup references a built-in register,
it is about to do something that is dependent on knowing about the
physical rendering medium, such as sub-character motion or drawing.

FEATURE SUPPORT IN OTHER MANPAGE-RENDERING PROGRAMS:

More detailed notes on feature support in programs other than groff follow.
Programs are listed roughly in decreasing order of groff
compatibility.

Heirloom troff: 

Gunnar Ritter, the maintainer, says: "supports almost all groff
requests; a complete list is in
<http://heirloom.sourceforge.net/doctools/troff.pdf>.  The exceptions
are mainly in areas which are irrelevant in the context of manual
pages, like debugging or color support. The only unsupported request
which sometimes occurs in manual pages is .fam."

I checked on the last bit; .fam is used in exactly 5 pages in the crpus,
two of which are groff documentation.  We can safely not support it.

Whatever subset of groff glyphs this supports, it's bound to be larger
than that of non-troff-descended rendering programs and thus
will not constrain the portable subset.  Thus I have not enumerated
the supported glyphs here.

Unix troff classic:

Supports all the features described above.  Because all the other
programs described here were modeled on it, it is not going to be a
constraint on the portable set.

doclifter:

doclifter supports the following troff requests: ab .am .as .bp .c2
.cc .cu .de .ds .em .fi .ft .ie .if .ig .nf .mso .nop .nr .pm .rm
.return .rn .rr .shift .so .sp .tm .tr .ul.

doclifter does not support .fam.  It treats .do as a no-op, a rather
dodgy procedure which (because of the restricted ways .do is used)
nevertheless gives good results.

doclifter handles *all* predefined groff glyphs, mapped to ISO escapes
and Unicode -- except the old-style Bell Labs bracket-pile characters.

doclifter handles the entire portable set of escapes as described above,
and also \c, \<CR>, some cases of \w, and some cases of \o.  (The remaining
cases are passed through with a warning.)

manServer:

Gunnar also reports: "the manServer script by Rolf Howarth lacks
support for .bp, .ul, .cu, .tm, .as, .em, .am, .rr, .pm, .cc,
.c2, .ab, and .do, so I think these also do not belong on the
list of safe requests. It lacks reasonable support for the \c
and \<CR> escape sequences."

Developer's docs are here: 
<http://www.squarebox.co.uk/users/rolf/download/manServer.shtml>.
I read through the sourcecode to determine its capabilities.

manServer handles these troff requests: .ds .nr .ti .rm .rn .de .ig
.so .ps .ft .nf .fi .br .sp .ta

manServer handles escapes \., \', \`, \&, \^, \0, \d, \e, \f, \n, \s, \u. 

manServer handles cases of \o that reduce to Latin-1 and Latin-2 accented
characters.

The KDE manpage viewer:  

Gunnar writes: "At its core, it seems to be a derivative of the
man2html program by Richard Verhoeven which is also part of Andries
Brouwer's man package:
<http://websvn.kde.org/trunk/KDE/kdebase/kioslave/man/man2html.cpp?rev=416894&view=auto>.
>From [the doclifter list of] requests, it lacks support for .bp,
.cu, .do, .em, .pm, .rr, and .ul. It implements all escape sequences
you consider as safe, and has a large list of supported special
characters which I am too lazy to examine in detail."

The KDE viewer supports built-in registers: n, t, o, e, l, .$, .A, .T, .V

Supported escapes are: \c, \e, \f, \n, \p, \s, \t, \w \<SP>, 
\$, \&, \', \`, \-, \., and others outside the set that man pages
actually use.

Escapes \0, \~, \|, and \^ are all mapped to an ordinary &nbsp; so the
latter two cannot really be said to be supported.  There is also faked
support for \z, \k, \!, \a, \d, \r, \u.

The glyph set includes the Greek alphabet (miniscule and majuscule),
the groff Latin-1 characters, the "registered", "copyright", and
"trademark" glyphs, and much of the classic troff glyph set.

        TODO: CHARACTERIZE THE MAN2HTML GLYPH SET BETTER
 
man2html:

This is not the C program that the KDE browser is based on, but a
crude Perl script that seems to have written in the mid-1990s and
been last modified in 2003.  There is a Savannah project page,
dormant, here: <http://savannah.nongnu.org/projects/man2html/>.

No glyph, escape, or register support at all. It's a good thing this
has been obsolesced by more recent converters or it would choke the
portable subsets of those right down to nothing.

(This is the man2html I was thinking about when I dismissed its 
translations as "crappy".  I was right... :-))

Xman:

        TODO: FIND OUT WHAT XMAN DOES

TKman:

TKman relies on nroff to format pages, then analyzes the generated ASCII
looking for section headers, references to manual pages, and other cliches.
It does no interpretation of troff markup itself. and is this not a
constraint on the set of portable features. 

Rosetta/PolyglotMan:

        TODO: FIND OUT WHAT POLYGLOTMAN DOES
------------------------------------------------------------------------------

Once I have the set of portable man-page constructs well characterized, I
intend to develop a set of patches for the groff distribution that will do the
following:

1) Trim the groff manual pages so they use only the portable subset, plus
the .SY and .OP macros that Werner and I have characterized.

2) Add a section on portable *roff requests to groff_man(7), including the
recommendation to define .SY, .OP, .EX/.EE and .DS/.DE locally for a 
while until the new man macros have time to propagate everywhere.

3) Add definitions of .EX/.EE and .DS/.DE to the man macros.

While I am doing these things, I will also be upgrading doclifter in
various ways:

1)  The next feature to go in will be the ability to 
recognize ad-hoc tables made with .nf/.ta./.fi and compile them into
DocBook table markup.

2) doclifter will be taught to recognize .SY and .OP.

3) I plan to add a validator option to doclifter that will issue
warnings on use of any request, escape, or register not in the defined
portable set.  By this means, man-page authors will be able to conveniently 
check the conformance of their pages to the portable set.

I want to get these patches out in a 1.20 release in time to make the 
Fedora 7 development freeze in late January.  

Yes, I know, Bernd Warken is in love with the hyperextended macros on
groffer.1 and elsewhere, and will go ballistic.  Too bad for him; we've
established that they break too much software to live. Are there any other 
objections, either substantive or procedural, to this work plan?  Any
constructive criticism or discussion I have not incorporated?

This is going to be a lot of work.  There are things I could use help with:

1) I don't have to be the one to implement .SY/.OP/.EX/.EE/.DS/.DE in 
an-old.tmac; someone else could do that.

2) Any help in filling out the TODOs in the above draft would speed
things up measurably.  Every hour I don't have to spend on research
others could do will be spent on related things only I am presently
qualified to work on, like doclifter internals.  Gunnar?  Anybody?

Here are two related tasks not on my schedule:

1)  Once we know what the portable set is, groff itself should issue
warnings when a man page uses a non-portable feature.  This should
be taken on by somebody who understands groff internals better than I
do.

2) Patches for .SY/.OP/.EX/.EE/.DS/.DE support should be developed for
the KDE help browser and shipped as soon as possible.

3) .SY/.OP/.EX/.EE/.DS/.DE will also be needed in Heirloom troff.  
This one is pretty obviously Gunnar's baby.

Open issues for discussion:

1) In defining the portable subset, do we want to take a conservative
approach that embraces only the intersection of the feature sets of
all viewers, or set a floor based on the capabilities of respectable 
modern viewers like the KDE help browser?

In practice, this question comes down to whether we're going to bless
Latin-1 as a portable character set and the groff glyphs mapping to
Latin-1 as portable.  

I favor setting a floor that includes Latin-1.

2) When, in the portable-subset description, can we say that .EX/.EE,
.SY/.OP, and .DS/.DE should be considered portable and no longer 
need local definitions?

I think two years from when we ship 1.20 seems reasonable.  That would give
groff-1.20, (hypothetical) KDE help-browser patches, and an update of
heiroom troff time to propagate.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]