[poppler] Extract title from pdf file.
lrosenth at adobe.com
Thu Nov 10 14:15:28 PST 2011
I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…
What you are doing is adding HEURISTICS into Poppler to GUESS at the logical structure of a PDF. You are NOT actually taking into account any REAL LIVE logical structure that was put their by the PDF producer.
PDF 1.3 is about 15 YEARS OLD. NUMEROUS ADVANCES have been made to the format. PDF is currently at 1.7, as standardized by the ISO and adopted as national standards by almost 50 countries around the world. Version 2.0 (ISO 32000-2) is almost complete! To work only with 1.3 is, honestly, a waste. You are missing HUGE PIECES of functionality found in the majority of real-world documents.
I am sure your code is wonderful. However, given that it is based on 1.3 and does not recognize existing PDF structure, it seems SEVERELY limited in real world use.
From: Alec Taylor <alec.taylor6 at gmail.com<mailto:alec.taylor6 at gmail.com>>
Date: Thu, 10 Nov 2011 13:57:54 -0800
To: Leonard Rosenthol <lrosenth at adobe.com<mailto:lrosenth at adobe.com>>
Cc: "poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>" <poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>>, Albert Cid <aacid at kde.org<mailto:aacid at kde.org>>
Subject: Re: [poppler] Extract title from pdf file.
As was previously mentioned, I am adding the semantic and logical structuring into poppler core.
My plan is to figure out what fits into which category by post processing the XML. Any suggestions on how to reverse [or post?!] engineer this XML back into the PDF would be appreciated.
In a few days I will have a very accurate XML genereated with <header></header>, <footer></footer> and table of contents tags.
This will involve the "pushing" of the actual "printed" page numbers, and adding hyperlink to each ToC entry, and partitioning the page structure as far as the 1.3 standard allows.
My code is extremely modular, neat & efficient, and included the writing of an OO API. So it should be easily extendable with author, title, publisher, year and section title extraction capabilities.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the poppler