[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Mission statement and Knuth-Plass reconsidered
From: |
Bertrand Garrigues |
Subject: |
Re: Mission statement and Knuth-Plass reconsidered |
Date: |
Sat, 08 Jul 2023 13:41:44 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) |
Hi Branden,
On mar., mai 23 2023 at 02:15:33 , "G. Branden Robinson"
<g.branden.robinson@gmail.com> wrote:
>> > Myself, I wonder if K-P couldn't be implemented above the formatter
>> > itself, using a diversion. We could then put the implementation in
>> > an auxiliary macro package.
[...]
> I'll note that even if K-P can't be done this way, because the algorithm
> requires a view of more than one page at a time
Indeed, I don't think it's possible to implement Knuth-Plass in a macro
package, it's already very difficult to implement it into tree, but in a
macro package it seems impossible.
> (I, uh, admit I haven't actually learned how Knuth-Plass works yet)
I've previously worked on Knuth-Plass [1], the old 'format_knuth_plass'
branch no longer compiled so I've deleted it and pushed a fresh new
'format_knuth_plass2' fully sync against master.
The best resource is of course Donald E. Knuth's "Digital Topography"
book, chapter "Breaking Paragraph Into Lines". This link [2] is also
very helpful with a visual description of how the algorithm works (In my
previous post I've said that there was a mistake in the demerit
calculation though, but I no longer remember what the problem exactly is).
Going back to my 'format_knuth_plass2': it contains a paragraph.cpp file
that fully implement the Knuth-Plass algorithm, but it is not yet
connected to the troff program.
Nevertheless the paragraph.cpp is fully tested by the 'test_paragraph'
test program. You have to explicitly build it (it is not built by
default):
make test_paragraph
It contains a unit test suite (using the CppUnit framework), if you
launch it without option it will run all the unit tests. If you pass
options '-s 1 -t 1' (suite 1 test 1) it will format the paragraph of the
original Knuth example:
-- Suite 1 Initialisation
-- Test11...
Checking the number of lines
Checking the lines adjustement ratio
Checking all breakpoints demerits
Checking the best breakpoints array
Number of lines: 10 | adj.
ratio | total demerits | fitness class |
In olden times when wishing still helped one, there lived a
0.774 2209 0
^
king whose daughters were all beautiful; and the youngest was
0.179 2213 0
^ ^
so beautiful that the sun itself, which has seen so much, was
0.629 2889 0
^ ^
astonished whenever it shone in her face. Close by the king's
0.545 3178 0
^ ^
castle lay a great dark forest, and under an old lime-tree in the
0.000 3179 0
^ ^ ^ ^
forest was a well, and when the day was very warm, the king's
0.079 3180 0
^ ^ ^ ^ ^ ^
child went out into the forest and sat down by the side of the
0.282 3189 0
^ ^ ^ ^ ^ ^
cool fountain; and when she was bored she took a golden ball,
0.294 3205 0
^ ^ ^ ^ ^ ^
and threw it up on high and caught it; and this ball was her
0.575 3605 0
^ ^ ^ ^ ^ ^ ^
favorite plaything.
0.000 3606 0
^ ^ ^ ^
-- Test test11_original_example PASSED
It shows, for each line, the candidate breakpoints (with a `^' sign) and
the calculation of "adjustment ratio" and "total demerits". The test
checks the obtained result against the values given in Knuth's book.
Note that Knuth later corrected his "demerit" formula, so this examples
uses the old formula but not the other examples.
You can also give to the 'test_paragraph' program any text in a file,
and he will try to format the text in a paragraph. You can adjust the
"tolerance" and the line width. For example if you use the attached
'lorem.txt' and do:
test_paragraph -f lorem.txt
it will first fail:
Processing content of ../../lorem.txt with tolerance:1.000 line length:500
[paragraph.cpp:1072:format_knuth_plass]Could not format paragraph
But you can play on the "tolerance" or the line length with -T and -l,
e.g. pass a tolerance of 2.1:
./test_paragraph -f lorem.txt -T 2.1
Processing content of lorem.txt with tolerance:2.100 line length:500
Number of lines: 14 | adj.
ratio | total demerits | fitness class |
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2.000 641601 3
^
Sed non risus. Suspendisse lectus tortor, dignissim sit amet,
1.000 651802 2
^ ^
adipiscing nec, ultricies sed, dolor. Cras elementum ultrices
1.192 680702 3
^
diam. Maecenas ligula massa, varius a, semper congue, euismod
0.100 690703 1
^
non, mi. Proin porttitor, orci nec nonummy molestie, enim est
0.242 690707 1
^ ^
eleifend mi, non fermentum diam nisl sit amet erat. Duis semper.
-0.737 692388 0
^ ^
Duis arcu massa, scelerisque vitae, consequat in, pretium a, enim.
-0.833 695869 0
^ ^ ^ ^
Pellentesque congue. Ut in risus volutpat libero pharetra tempor.
-0.733 697469 0
^ ^ ^
Cras vestibulum bibendum augue. Praesent egestas leo in pede.
0.000 697470 1
Praesent blandit odio eu enim. Pellentesque sed dui ut augue
0.633 698146 2
blandit sodales. Vestibulum ante ipsum primis in faucibus orci
0.296 698162 1
luctus et ultrices posuere cubilia Curae; Aliquam nibh. Mauris ac
-0.438 698243 1
mauris sed pede pellentesque fermentum. Maecenas adipiscing
0.667 699204 2
or keep the tolerance to 1.0 and use a line of 850:
./test_paragraph -f lorem.txt -l 850
Processing content of lorem.txt with tolerance:1.000 line length:850
Number of lines: 8
| adj. ratio | total demerits | fitness
class |
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus.
Suspendisse lectus tortor, dignissim -0.423 10081
1
^
sit amet, adipiscing nec, ultricies sed, dolor. Cras elementum ultrices
diam. Maecenas ligula massa, varius 0.365 10117
1
^
a, semper congue, euismod non, mi. Proin porttitor, orci nec nonummy
molestie, enim est eleifend mi, non 0.175 10121
1
^
^ ^
fermentum diam nisl sit amet erat. Duis semper. Duis arcu massa, scelerisque
vitae, consequat in, pretium a, -0.200 10125
1
enim. Pellentesque congue. Ut in risus volutpat libero pharetra tempor. Cras
vestibulum bibendum augue. 0.229 10129
1
Praesent egestas leo in pede. Praesent blandit odio eu enim. Pellentesque
sed dui ut augue blandit sodales. 0.185 10133
1
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere
cubilia Curae; Aliquam nibh. Mauris -0.071 10134
1
ac mauris sed pede pellentesque fermentum. Maecenas adipiscing ante non diam
sodales hendrerit. 0.000 10135
1
Of course the shorter the line, the harder it is for the algorithm to
find a solution with the given tolerance.
Note that 'test_pragraph' supports only ascii, the width of the various
letter are hard-coded with values from Knuth'book, and there is no
hyphenation support (only for Knuth's original example I simulate that
we may have some words hyphenated).
The main difficulties to connect this work to troff are that the three
big files of troff (input.cpp, env.cpp, node.cpp) are too closely tied
and that the whole line-oriented logic is difficult to adapt
(particularly the diversion is hard to manage). Some preliminary
refactoring would probably be needed. If people are interested I may
try to find some time to work on it.
[1] https://lists.gnu.org/archive/html/groff/2017-11/msg00079.html
[2] https://defoe.sourceforge.net/folio/knuth-plass.html
Regards,
Bertrand
lorem.txt
Description: Text document
- Re: Mission statement and Knuth-Plass reconsidered,
Bertrand Garrigues <=