[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Problems with .PDFPIC caused by pdfinfo
From: |
Deri |
Subject: |
Re: Problems with .PDFPIC caused by pdfinfo |
Date: |
Tue, 12 Oct 2021 16:12:56 +0100 |
On Tuesday, 12 October 2021 11:49:23 BST Keith Marshall wrote:
> Ref: https://savannah.gnu.org/bugs/index.php?55107
>
> On 01/10/2021 01:10, Deri wrote:
> > I did try to help Keith with this previously, but I was mildly "told
> > off" (on list) for sending my help off list. I've learned my lesson.
>
> Thanks, Deri.
>
> IIRC, the reason for the "mild telling off" was that, by replying off
> list, you denied us the potential benefit from other list members who
> may have been willing to review the issue, and so contribute to the
> debugging effort. I am pleased that, on this occasion, you have kept
> this on-list; even if the majority of list members aren't sufficiently
> interested to assist, there may be some who will, and any assistance
> will be gratefully accepted, and very much appreciated.
>
Hi Keith,
I just assumed the best person for debugging faults in the code would probably
be you rather than the rest of us. You may receive other "problem pdfs" from
other members, but the debugging effort is likely to be yours alone.
What I did find useful while debugging the pdf parser in pdfbb/gropdf was the
Ghent PDF Output Suite (which has some very esoteric examples - sorry it is
144mb!), see:-
http://gwg.org/gos5/
> > I attach a couple of pdfs with which the current code has problems.
> >
> > Picture.pdf
> >
> > [derij@pip groff-psbb]$ ./psbb ../../Picture.pdf
> > ../../Picture.pdf: bounding box = (0,0)..(0,0)
>
> This is caused by the nested /Group dictionary, within the /Page object;
> the current groff-psbb lexer is confused by it, and ends up in the wrong
> state, when it eventually encounters the /MediaBox key. Adding one more
> rule (for "<<") to the PDF dictionary state scanning model gets us to:
>
> $ ./psbb Picture.pdf
> Picture.pdf: bounding box = (0,0)..(592,842)
>
> > [derij@pip groff-psbb]$ pdfbb ../../Picture.pdf
> > Processing '../../Picture.pdf'
> > ../../Picture.pdf: CropBox: 162.085,623.346,340.825,716.546 (178.74,93.2)
>
> The psbb lexer doesn't handle the /CropBox key. Should it? Should
> /CropBox override any extant /MediaBox?
If you view Picture.pdf with a pdf viewer you will see a dumb bell shape, this
is in fact the area of the A4 page described by the CropBox, not the complete
A4 page described by the MediaBox. If the MediaBox dimensions were given to
PDFPIC the included picture would be the wrong shape. Current gropdf honours
the various "boxes" in this order:-
ArtBox TrimBox BleedBox CropBox MediaBox
(No idea if this is "correct", but the viewers I have tested definitely
prioritise CropBox over MediaBox, you will have to experiment).
You would also have to be careful, a MediaBox at the group level could be
overridden by a CropBox at the page level, I assume.
> > croptest.pdf
> >
> > [derij@pip groff-psbb]$ ./psbb ../../croptest.pdf
> > psbb:t-psbb (t-psbb.cpp):193: PDF file '../../croptest.pdf' is
> > malformed; no trailer found
>
> Since croptest.pdf lacks both a trailer dictionary, and a free-standing
> cross reference table, (both are hidden away within a /XRefStm object,
> with a compressed cross reference table), croptest.pdf is _incompatible_
> with applications which do not support this feature of PDF-1.5 (and
> later). The groff-psbb prototype implementation (currently) does not
> offer this level of PDF-1.5 support; thus, this behaviour is expected.
Gropdf/pdfbb now supports import of these later pdf versions (as does pdfinfo
which PDFPIC currently uses) so it is important that whatever method is used
to report the image dimensions back to PDFPIC is consistent with what a user
would see when viewing the pdf in a viewer.
> > [derij@pip groff-psbb]$ pdfbb ../../croptest.pdf
> > Processing '../../croptest.pdf'
> > ../../croptest.pdf: MediaBox: 0,0,595,842 (595,842)
>
> Well, this agrees with the result I've shown above, for Picture.pdf,
Croptest.pdf is an A4 page written as a PDF 1.7 file but the included image
(three times) is the CropBox from Picture.pdf. So the dimensions reported by
pdfbb are correct, its an A4 page, but not because the Picture.pdf is wrongly
reported as A4 by psbb.
I have attached a new version called croptest-2.pdf, which psbb successfully
reports as A4 (because this time it is written in PDF 1.4) but is showing that
groff can embed a PDF 1.7 image (croptest.pdf) which itself contains three PDF
1.5 images (Picture.pdf). I also enclose the troff files which created the two
pdfs, which shows that you don't need to use PDFPIC if you are concerned about
using unsafe mode in groff. The only thing which PDFPIC does is calculate the
vertical movement to do after the call to \X'pdf: pdfpic’ to continue output
after the image, which is fairly easy to do manually given the information
from pdfinfo.
> with groff-psbb modified to properly handle nested dictionaries; some
> further (non-trivial) development effort will be required, to support
> concealment of trailer dictionaries and cross reference tables within
> /XRefStm objects.
There are several options which would address this problem, i.e. non
portability of grep and desirability of avoiding groff unsafe mode.
A) Replace grep with sed/awk (still requires unsafe mode).
B) Use psbb (requires "non-trivial development").
C) Use pdfbb (requires hook in input.cpp to call pdfbb and return results).
D) Convert pdfbb to be a pre-gropdf (i.e. a preprocessor like pre-grohtml)
which would look for .PDFPIC and replace with the appropriate calls to \X'pdf:
pdfpic’ and add vertical space with .sp.
(A) is obviously the easiest and quickest, (C) and (D) are not too much work,
since the parser required is already in use.
Cheers
Deri
croptest-2.pdf
Description: Adobe PDF document
croptest-2.trf
Description: Text document
croptest.trf
Description: Text document