[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] Extracting Groff markup from ghostscript
From: |
Larry Kollar |
Subject: |
Re: [Groff] Extracting Groff markup from ghostscript |
Date: |
Mon, 21 Jun 2004 20:29:36 -0400 |
address@hidden wrote:
Just wondering of there' a way to "reverse engineer" a ghostscript
file and
recreate the groff source file?
You keep pushing my "got to answer this" button. :-) In a nutshell,
it should be possible but nobody has actually gone and done it.
I've actually taken a step or two in that direction, but have a way
to go before I could start producing groff markup.
Start with a copy of the "ps2ascii" script, and remove the string
"-dSIMPLE" from the end of the OPTIONS line. Instead of just
dumping text, ps2ascii-copy then outputs a very simple command
language consisting of the following commands:
F height width (fontname) <-- font name literally in ( )
P <-- page break
S x y (string) width <-- display text at the
specified point
The first thing you'll notice is that the text is extremely fragmented;
it might even break words apart in some cases. I wrote a simple
awk script to join strings that need joining; it's very easy to parse
this stuff with awk. Another script throws away everything above a
certain point and below another point (to get rid of headers and
footers).
At that point, you can identify paragraph breaks by vertical gaps
and font/size changes by the F command. You'll probably have
to work a little harder to identify headings (NH etc); if the original
output has numbered headings you can work with those. Without
numbered headings, you'll have to key on font and size changes.
Hope that gets you started.
--
Larry Kollar k o l l a r @ a l l t e l . n e t
Unix Text Processing: "UTP Revival"
http://home.alltel.net/kollar/utp/