[poppler] Extract title from pdf file.

Josh Richardson jric at chegg.com
Thu Nov 10 14:28:10 PST 2011


Leonard, I don't understand.  You say Alec is "missing HUGE PIECES of functionality found in the majority of real-world documents", but Albert says he has 1200 documents and none of them has markings.  So, which is it, or what is it that Alec's missing?

I've got access to more than 10k PDFs, published in the past year or two, which I'd be happy to check, if you can tell me how.  I'd be curious to know how many of them are taking advantage of these newer PDF features, and I'd LOVE it if they all were.  Sadly, my guess is that it's close to zero. :-(

--josh

From: Leonard Rosenthol <lrosenth at adobe.com<mailto:lrosenth at adobe.com>>
Date: Thu, 10 Nov 2011 14:15:28 -0800
To: Alec Taylor <alec.taylor6 at gmail.com<mailto:alec.taylor6 at gmail.com>>
Cc: Cid <aacid at kde.org<mailto:aacid at kde.org>>, "Albert at freedesktop.org<mailto:Albert at freedesktop.org>" <Albert at freedesktop.org<mailto:Albert at freedesktop.org>>, "poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>" <poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>>
Subject: Re: [poppler] Extract title from pdf file.

I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…

What you are doing is adding HEURISTICS into Poppler to GUESS at the logical structure of a PDF.  You are NOT actually taking into account any REAL LIVE logical structure that was put their by the PDF producer.

PDF 1.3 is about 15 YEARS OLD.  NUMEROUS ADVANCES have been made to the format.  PDF is currently at 1.7, as standardized by the ISO and adopted as national standards by almost 50 countries around the world.  Version 2.0 (ISO 32000-2) is almost complete!  To work only with 1.3 is, honestly, a waste.  You are missing HUGE PIECES of functionality found in the majority of real-world documents.

I am sure your code is wonderful.  However, given that it is based on 1.3 and does not recognize existing PDF structure, it seems SEVERELY limited in real world use.

Leonard

From: Alec Taylor <alec.taylor6 at gmail.com<mailto:alec.taylor6 at gmail.com>>
Date: Thu, 10 Nov 2011 13:57:54 -0800
To: Leonard Rosenthol <lrosenth at adobe.com<mailto:lrosenth at adobe.com>>
Cc: "poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>" <poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>>, Albert Cid <aacid at kde.org<mailto:aacid at kde.org>>
Subject: Re: [poppler] Extract title from pdf file.


As was previously mentioned, I am adding the semantic and logical structuring into poppler core.

My plan is to figure out what fits into which category by post processing the XML. Any suggestions on how to reverse [or post?!] engineer this XML back into the PDF would be appreciated.

In a few days I will have a very accurate XML genereated with <header></header>, <footer></footer> and table of contents tags.

This will involve the "pushing" of the actual "printed" page numbers, and adding hyperlink to each ToC entry, and partitioning the page structure as far as the 1.3 standard allows.

My code is extremely modular, neat & efficient, and included the writing of an OO API. So it should be easily extendable with author, title, publisher, year and section title extraction capabilities.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111110/d9baf520/attachment.htm>


More information about the poppler mailing list