[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] On copying text from PDF files that started with groff
From: |
Stephen Holland |
Subject: |
[Groff] On copying text from PDF files that started with groff |
Date: |
Wed, 24 Jan 2007 20:18:59 -0600 |
I have been using groff to programmatically create PDF documents for
my medical practice. The workflow is a web page controlled with PHP
generates a text file that is processed by groff. The postscript
file them is run through pstopdf and I have the PDF I need. I have
been delighted at the ease with which it works.
Recently I was using my Mac's spotlight search engine to look for
files through keyword searches and found I was having trouble
locating pages. Also, when I copy text from a PDF so generated the
text copies with odd spacing.
This behavior seems to be related to kerning functions generated by
groff. Following is an example of the problem.
The following text was part of a document I passed through groff
( === delimits the example text ):
===
Findings: Mild diffuse thickening of the esophagus with linear
furrows in the mid esophagus and circumferential rings in the
proximal esophagus. The mucosal vascular pattern is effaced. Patchy
erythema in the third portion of the duodenum. Biopsies taken in the
duodenum, stomach, and esophagus.
Recommendations: This appears to be eosinophilic esophagitis. Start
Protonix, 40 mg qd. Avoid milk and eggs. Check celiac antibody
panel and vitamin panel.
===
groff obligingly creates a PS file containing:
===
(Findings: Mild diffuse thic)72 416 Q -.24(ke)
-.24 G(ning of the esophagus with linear furro).24 E
(ws in the mid esopha-)-.18 E(gus and circumf)72 430 Q(erential r)-.36 E
(ings in the pro).18 E(ximal esophagus)-.36 E 6.672(.T)-.18 G(he m)
-6.672 E(ucosal v)-.12 E(ascular patter)-.3 E(n).3 E(is eff)72 444 Q
3.336(aced. P)-.36 F(atch)-.48 E 3.336(ye)-.36 G .36(ry)-3.336 G
(thema in the third por)-.36 E(tion of the duoden).48 E 3.336
(um. Biopsies)-.12 F(tak)3.336 E(en in the)-.24 E(duoden)72 458 Q
(um, stomach, and esophagus)-.12 E(.)-.18 E
(Recommendations: This appears to be eosinophilic esophagitis)72 486 Q
6.672(.S)-.18 G -2.856(tar t)-6.672 F(Protonix, 40 mg)3.336 E 3.336
(qd. A)72 500 R -.3(vo)-.48 G(id milk and eggs).3 E 6.672(.C)-.18 G(hec)
-6.672 E 3.336(kc)-.24 G(eliac antibody panel and vitamin panel.)-3.336
===
and when run through pstopdf a PDF appears. When copying out the
paragraph above one gets:
===
Findings: Mild diffuse thickening of the esophagus with linear
furrows in the mid esopha-
gus and circumferential rings in the proximal esophagus. The mucosal
vascular pattern
is effaced. Patchyerythema in the third portion of the duodenum.
Biopsiestaken in the
duodenum, stomach, and esophagus.
Recommendations: This appears to be eosinophilic esophagitis. Star
tProtonix, 40 mg
qd. Avoid milk and eggs. Checkceliac antibody panel and vitamin panel.
===
Note that the words 'Patchy erythema' and 'Biopsies taken' are run
together. The words 'Start Protonix' are morphed to 'Star tProtonix'
When checking what the text parser for the mac sees the problems are
repeated. mdfind, the import process for Mac OSX finds the following
words:
===
Findings: Mild diffuse thickening of the esophagus with linear
furrows in the mid esopha- gus and circumferential rings in the
proximal esophagus. The mucosal vascular pattern is effaced.
Patchyerythema in the third portion of the duodenum. Biopsiestaken in
the duodenum, stomach, and esophagus. Recommendations: This appears
to be eosinophilic esophagitis. Star tProtonix, 40 mg qd. Avoid milk
and eggs. Checkceliac antibody panel and vitamin panel.
===
The reason this is a problem is that the indexing program now is not
getting correct input and the index into my files misses that this
patient document should be found with the term celiac. It will find
the document with the term 'checkceliac' as a single word.
Looking at the postscript it is evident that the postscript
(.C)-.18 G(hec) -6.672 E 3.336(kc)-.24 G(eliac antibody panel and
vitamin panel.)-3.336
is causing problems for several later processes.
I was surprised to see the strings in postscript output as they are.
So, with all that, is there an option to get groff to stop adjusting
withing words? For my needs, adjusting whitespace size is all that
is needed. Or should all this be referred to a grops mailing list?
Steve Holland
- [Groff] On copying text from PDF files that started with groff,
Stephen Holland <=