[poppler] Extract title from pdf file.
alec.taylor6 at gmail.com
Thu Nov 10 14:25:59 PST 2011
Poppler doesn't fully support 1.7.
Perhaps 1.3 was an understatement.
I will add in the aforementioned heuristics (I don't know my accuracy
yet, but the kind of algorithms I am implementing have >98% accuracy),
using whatever assistence poppler provides, adhering to the latest
standard poppler supports.
Would appreciate any help you (or anyone else) can give for pushing
what I have separated into XML tags back into the PDF.
On Fri, Nov 11, 2011 at 9:15 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
> I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…
> What you are doing is adding HEURISTICS into Poppler to GUESS at the logical
> structure of a PDF. You are NOT actually taking into account any REAL LIVE
> logical structure that was put their by the PDF producer.
> PDF 1.3 is about 15 YEARS OLD. NUMEROUS ADVANCES have been made to the
> format. PDF is currently at 1.7, as standardized by the ISO and adopted as
> national standards by almost 50 countries around the world. Version 2.0
> (ISO 32000-2) is almost complete! To work only with 1.3 is, honestly, a
> waste. You are missing HUGE PIECES of functionality found in the majority
> of real-world documents.
> I am sure your code is wonderful. However, given that it is based on 1.3
> and does not recognize existing PDF structure, it seems SEVERELY limited in
> real world use.
> From: Alec Taylor <alec.taylor6 at gmail.com>
> Date: Thu, 10 Nov 2011 13:57:54 -0800
> To: Leonard Rosenthol <lrosenth at adobe.com>
> Cc: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>, Albert
> Cid <aacid at kde.org>
> Subject: Re: [poppler] Extract title from pdf file.
> As was previously mentioned, I am adding the semantic and logical
> structuring into poppler core.
> My plan is to figure out what fits into which category by post processing
> the XML. Any suggestions on how to reverse [or post?!] engineer this XML
> back into the PDF would be appreciated.
> In a few days I will have a very accurate XML genereated with
> <header></header>, <footer></footer> and table of contents tags.
> This will involve the "pushing" of the actual "printed" page numbers, and
> adding hyperlink to each ToC entry, and partitioning the page structure as
> far as the 1.3 standard allows.
> My code is extremely modular, neat & efficient, and included the writing of
> an OO API. So it should be easily extendable with author, title, publisher,
> year and section title extraction capabilities.
More information about the poppler