[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] hyphenation problems
From: |
Werner LEMBERG |
Subject: |
[Groff] hyphenation problems |
Date: |
Sun, 04 Feb 2001 20:34:05 +0100 (CET) |
Dear friends,
more than a year ago I began to maintain groff, and the following
problem was the very reason why I did so:
> None of this avoids the central issue, which is: why does groff
> suppress hyphenation break at "-" in the context of words with
> escapes like "\ " in them?
To be more specific, consider the following German input file `foo':
Eingabe-Kodepunkt\ 0xABCD
Eingabe-Kodepunkt\ 0xABCD
Eingabe-Kodepunkt\ 0xABCD
Eingabe-Kodepunkt\ 0xABCD
Eingabe-Kodepunkt\ 0xABCD
If you say `groff -Tlatin1 foo', you get this:
Eingabe-Kodepunkt 0xABCD Eingabe-Kodepunkt 0xABCD Eingabe-
Kodepunkt 0xABCD Eingabe-Kodepunkt 0xABCD
Eingabe-Kodepunkt 0xABCD
As you can see, the fourth word (in the second line) isn't hyphenated,
whereas the third word is. This is definitely a bug, which I believe
is fixed now -- the current groff snapshot produces
Eingabe-Kodepunkt 0xABCD Eingabe-Kodepunkt 0xABCD Eingabe-
Kodepunkt 0xABCD Eingabe-Kodepunkt 0xABCD Eingabe-
Kodepunkt 0xABCD
(Ruslan, this also fixes the hyphenation problem with boxes you've
encountered).
Nevertheless, the applied changes might have side effects, so I ask
you urgently to test it rigorously with huge volumes of text, checking
whether hyphenation has changed unexpectedly.
It is probably of interest to know exactly when and how GNU troff
hyphenates a line, so here are the rules (this will eventually go into
groff.texinfo). This will also help you to identify hyphenation
problems.
======================================================================
GNU troff calls the routine environment::possibly_break_line() in the
following cases:
1. If a space is encountered.
2. If a newline is encountered (not preceded by `\c').
3. If a `br' request has been seen.
4. If a token node is found in the input stream.
a. This happens for the following objects: "\ ", \:, \|, \^, \?,
\0, "\,", \a, \b, \d, \D, \h, \l, \L, \o, \r, \t, \u, \v, \x,
\X, \Y, \z, and \Z.
b. A diversion resp. box is inserted into the text. Usually, all
input in diversions and boxes has already been converted to
nodes. The reality is a bit more complicated since there are
some possibilities to avoid resp. undo this conversion
(e.g. using `\!' or `.asciify').
possibly_break_line() does nothing if not in fill mode, or if a tab or
field is active, or if inside of a `dummy' environment (e.g. within
.if "..."..." or \w'...').
possibly_break_line() will call environment::hyphenate_line() in the
following cases:
5. `\p' is found in the input stream in cases 1. and 2.
6. If the total length of the nodes processed so far minus the width
of the last node is larger than the text length. This is the
normal situation at the end of a line.
hyphenate_line() will do the following:
7. It searches backwards from the current position for a boundary
node used as a starting point. Most of the escape sequences
listed in 4.a are considered as boundary nodes, together with
other horizontal and vertical space nodes.
8. It continues searching backwards for usable nodes until it finds
another boundary, checking for a leading `\%'.
9. If no leading `\%' has been encountered, hyphenation codes are
adjusted if necessary so that the nodes can be added to the
breakpoint list.
10. If no leading `\%' has been encountered and the hyphenation flags
fit, the hyphenation algorithm is applied to the found sequence
of nodes, building a breakpoint list.
11. The nodes chain itself into a new list, inserting discretionary
hyphens according to the breakpoint list resp. whether the
current character is a hyphenation character (usually `-') or
`\%' within the scanned node sequence.
Finally, possibly_break_line() will call
environment::choose_breakpoints() to find the best breakpoint from the
new list according to border conditions like hyphenation flags or
hyphenation margin.
======================================================================
These are the old rules. The bug which I've fixed is in rule 7.
Let's look again at the above example: Here I've marked the points
where GNU troff searches for breakpoints:
Eingabe-Kodepunkt\ 0xABCD
^ ^
The third word is hyphenated because troff calls possibly_break_line()
at `\ ' which happens to be a boundary character. The space before
the word is another boundary character, so the sequence
`Eingabe-Kodepunkt' is checked for breakpoints.
The fourth word isn't hyphenated because troff calls
possibly_break_line() after the space which follows the word, and the
previous boundary character is `\ ', thus only the sequence `0xABCD'
is scanned for breakpoints which fails of course.
To fix this, I've changed rule 7 as follows:
7. It searches backwards from the current position for a boundary
node used as a starting point. If hyphenate_line() has been
called via 4.a or 4.b, use the current node instead as a starting
point. Only some of the escape sequences listed in 4.a (usually
causing vertical movement) and horizontal space nodes are taken
as boundary nodes.
In the above example, the `\ ' no longer counts as a boundary which
gives the improved result.
Werner
- [Groff] hyphenation problems,
Werner LEMBERG <=