[poppler] page sequence + page spreads in PDF

Leonard Rosenthol lrosenth at adobe.com
Wed Feb 2 17:52:13 PST 2011


Wow!   The people producing these PDFs need a SERIOUS lesson in proper PDF production...These files are wasting A LOT of space because of what they are doing....

There is no magic here - just stupidity.

Each of the spreads is DUPLICATED in the PDF - and then cropped (CropBox != MediaBox)  to the right or left accordingly.   That's why it renders as single pages, because that's what is defined as the viewable area.  

Apparently the commands you are using with ImageMagick aren't respecting that cropbox.  Start by making sure you are current with IM and also Ghostscript.  

Leonard

-----Original Message-----
From: poppler-bounces+leonardr=adobe.com at lists.freedesktop.org [mailto:poppler-bounces+leonardr=adobe.com at lists.freedesktop.org] On Behalf Of Michael Howard
Sent: Wednesday, February 02, 2011 1:08 PM
To: poppler at lists.freedesktop.org
Subject: [poppler] page sequence + page spreads in PDF

My questions are intended for the poppler / PDF gurus.

They aren't really poppler questions, but relate to the sequence of
pages and page spreads in PDF files.

I have googled and have read through the PDF reference 1.4, but
haven't found anything to answer my questions.



BACKGROUND

I have a relatively large set of PDF files for magazines. These are
PDF files that were sent to the print shop for printing.

We want to extract .jpg images of the pages and the text on the pages.
I am using ImageMagick convert (Ghostcript) to generate the images and
poppler pdftotext to extract the text.

Most of the pages of the magazines are (more-or-less) 8.5x11 portrait
pages. However, some pages are 11x17 landscape "spreads" of two facing
pages.

In some cases, the outside covers and inside covers are in 2-page
"spreads". That is, the front cover + spine + rear cover are all on a
single PDF page. This is understandable since this is the way that the
paper magazines were printed.


SAMPLE FILES

In the file

  http://cdn.uforlife.com/public/TLN200806.pdf

the first two "pages" are spreads of the outside covers and inside covers

[mth at localhost ~]$ identify TLN200806.pdf | head
TLN200806.pdf[0] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass
4.03MB 0.220u 0:00.210
TLN200806.pdf[1] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass
4.03MB 0.210u 0:00.210
TLN200806.pdf[2] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass
4.03MB 0.200u 0:00.199
TLN200806.pdf[3] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass
4.03MB 0.200u 0:00.199


In the file

 http://cdn.uforlife.com/public/TLN200812.pdf

only the first "page" is a spread of the outside covers

[mth at localhost ~]$ identify TLN200812.pdf | head
TLN200812.pdf[0] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass
4.657MB 0.250u 0:00.250
TLN200812.pdf[1] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass
4.657MB 0.240u 0:00.240
TLN200812.pdf[2] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass
4.657MB 0.240u 0:00.240
TLN200812.pdf[3] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass
4.657MB 0.240u 0:00.240



COVER SPREAD / PAGE SEQUENCE QUESTION

Given files with the back cover & front cover on facing spreads, I
have observed that both Acrobat Reader and Evince properly split the
spreads at the beginning and end. So, when one looks in these viewers
one properly sees the outside front cover at the beginning and the
outside back cover at the end.

Note that in the case of http://cdn.uforlife.com/public/TLN200806.pdf
the inside covers are also properly split.

I need to do a similar thing in order to properly generate
correctly-sequence .jpg images of the pages ...

Q: What attribute / tag / characteristic is in the .pdf file that
tells a renderer to split the first page into two pages and insert
them at different places in the sequence?


The images in the spreads contain the "spine" of the magazine too. I
can see this if I use ImageMagick convert to generate the .jpg images.
Yet, this is not shown in Evince or Acrobat Reader ...

Q: What attribute enabled the "spine" of the book to be cut out and
not displayed as part of the front cover nor back cover?



EMBEDDED SPREAD QUESTION

In both of these sample files there are embedded two-page spreads. In
the printed book these spread span two facing pages. In the file
http://cdn.uforlife.com/public/TLN200806.pdf they are displayed in
evince & acrobat reader on pages 20 & 37.

Note that in this case neither evince nor acrobat reader recognizes
that these are two-page spreads. Rather, both viewers treat these
spreads as a single page. Even in Dual / Two Page view, both programs
show another page alongside.

Given that the covers & inside covers are handled 'correctly' I am
somewhat surprised that the embedded pages are not handled a little
better ...

Q: Why are these facing pages "spreads" not identified as two
individual pages by evince / acrobat reader ?



Any other advice from the poppler / PDF gurus would be greatly appreciated.


Thanks,
Michael
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler


More information about the poppler mailing list