[poppler] Extract title from pdf file.

Alec Taylor alec.taylor6 at gmail.com
Thu Nov 10 14:50:19 PST 2011


On Fri, Nov 11, 2011 at 9:42 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
> Albert was looking in the wrong place :).
> Check for either the MarkInfo and/or StructTreeRoot key in the Catalog.
> Logical Structure was introduced in PDF 1.3 and Tagged PDF in 1.4 – so these
> features aren't all that new.

Yeah, I knew I hadn't read it wrong. Logical structure certainly was
introduced in 1.3 :)

> They are generated by numerous PDF producers including (but not limited to)
> Adobe Acrobat, MS Office 2007 and later, OpenOffice, pdfTeX, etc.  These
> features are required in various international standards such as PDF/A-1a
> and PDF/A-2a as well as the new PDF/UA.
> I wish they all used it too…Unfortunately, many less capable PDF producers
> don't support it.

Not to mention the billions of PDFs available already online, which
don't adhere to the newer standards. My goal is to automatically
impose logical structures onto book PDF documents.

> Leonard
> From: Josh Richardson <jric at chegg.com>
> Date: Thu, 10 Nov 2011 14:28:10 -0800
> To: Leonard Rosenthol <lrosenth at adobe.com>, Alec Taylor
> <alec.taylor6 at gmail.com>
> Cc: Albert Cid <aacid at kde.org>, "Albert at freedesktop.org"
> <Albert at freedesktop.org>, "poppler at lists.freedesktop.org"
> <poppler at lists.freedesktop.org>
> Subject: Re: [poppler] Extract title from pdf file.
>
> Leonard, I don't understand.  You say Alec is "missing HUGE PIECES of
> functionality found in the majority of real-world documents", but Albert
> says he has 1200 documents and none of them has markings.  So, which is it,
> or what is it that Alec's missing?
> I've got access to more than 10k PDFs, published in the past year or two,
> which I'd be happy to check, if you can tell me how.  I'd be curious to know
> how many of them are taking advantage of these newer PDF features, and I'd
> LOVE it if they all were.  Sadly, my guess is that it's close to zero. :-(
> --josh
> From: Leonard Rosenthol <lrosenth at adobe.com>
> Date: Thu, 10 Nov 2011 14:15:28 -0800
> To: Alec Taylor <alec.taylor6 at gmail.com>
> Cc: Cid <aacid at kde.org>, "Albert at freedesktop.org" <Albert at freedesktop.org>,
> "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: Re: [poppler] Extract title from pdf file.
>
> I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…
> What you are doing is adding HEURISTICS into Poppler to GUESS at the logical
> structure of a PDF.  You are NOT actually taking into account any REAL LIVE
> logical structure that was put their by the PDF producer.
> PDF 1.3 is about 15 YEARS OLD.  NUMEROUS ADVANCES have been made to the
> format.  PDF is currently at 1.7, as standardized by the ISO and adopted as
> national standards by almost 50 countries around the world.  Version 2.0
> (ISO 32000-2) is almost complete!  To work only with 1.3 is, honestly, a
> waste.  You are missing HUGE PIECES of functionality found in the majority
> of real-world documents.
> I am sure your code is wonderful.  However, given that it is based on 1.3
> and does not recognize existing PDF structure, it seems SEVERELY limited in
> real world use.
> Leonard
> From: Alec Taylor <alec.taylor6 at gmail.com>
> Date: Thu, 10 Nov 2011 13:57:54 -0800
> To: Leonard Rosenthol <lrosenth at adobe.com>
> Cc: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>, Albert
> Cid <aacid at kde.org>
> Subject: Re: [poppler] Extract title from pdf file.
>
> As was previously mentioned, I am adding the semantic and logical
> structuring into poppler core.
>
> My plan is to figure out what fits into which category by post processing
> the XML. Any suggestions on how to reverse [or post?!] engineer this XML
> back into the PDF would be appreciated.
>
> In a few days I will have a very accurate XML genereated with
> <header></header>, <footer></footer> and table of contents tags.
>
> This will involve the "pushing" of the actual "printed" page numbers, and
> adding hyperlink to each ToC entry, and partitioning the page structure as
> far as the 1.3 standard allows.
>
> My code is extremely modular, neat & efficient, and included the writing of
> an OO API. So it should be easily extendable with author, title, publisher,
> year and section title extraction capabilities.


More information about the poppler mailing list