<html><head> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>Albert was looking in the wrong place :).   </div><div><br></div><div>Check for either the MarkInfo and/or StructTreeRoot key in the Catalog.   Logical Structure was introduced in PDF 1.3 and Tagged PDF in 1.4 – so these features aren't all that new.</div><div><br></div><div>They are generated by numerous PDF producers including (but not limited to) Adobe Acrobat, MS Office 2007 and later, OpenOffice, pdfTeX, etc.  These features are required in various international standards such as PDF/A-1a and PDF/A-2a as well as the new PDF/UA.</div><div><br></div><div>I wish they all used it too…Unfortunately, many less capable PDF producers don't support it.</div><div><br></div><div>Leonard</div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Josh Richardson <<a href="mailto:jric@chegg.com">jric@chegg.com</a>><br><span style="font-weight:bold">Date: </span> Thu, 10 Nov 2011 14:28:10 -0800<br><span style="font-weight:bold">To: </span> Leonard Rosenthol <<a href="mailto:lrosenth@adobe.com">lrosenth@adobe.com</a>>, Alec Taylor <<a href="mailto:alec.taylor6@gmail.com">alec.taylor6@gmail.com</a>><br><span style="font-weight:bold">Cc: </span> Albert Cid <<a href="mailto:aacid@kde.org">aacid@kde.org</a>>, "<a href="mailto:Albert@freedesktop.org">Albert@freedesktop.org</a>" <<a href="mailto:Albert@freedesktop.org">Albert@freedesktop.org</a>>, "<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>" <<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>><br><span style="font-weight:bold">Subject: </span> Re: [poppler] Extract title from pdf file.<br></div><div><br></div><div><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>Leonard, I don't understand.  You say Alec is "missing HUGE PIECES of functionality found in the majority of real-world documents", but Albert says he has 1200 documents and none of them has markings.  So, which is it, or <i>what</i> is it that Alec's missing?</div><div><br></div><div>I've got access to more than 10k PDFs, published in the past year or two, which I'd be happy to check, if you can tell me how.  I'd be curious to know how many of them are taking advantage of these newer PDF features, and I'd LOVE it if they all were.  Sadly, my guess is that it's close to zero. :-(</div><div><br></div><div>--josh</div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Leonard Rosenthol <<a href="mailto:lrosenth@adobe.com">lrosenth@adobe.com</a>><br><span style="font-weight:bold">Date: </span> Thu, 10 Nov 2011 14:15:28 -0800<br><span style="font-weight:bold">To: </span> Alec Taylor <<a href="mailto:alec.taylor6@gmail.com">alec.taylor6@gmail.com</a>><br><span style="font-weight:bold">Cc: </span> Cid <<a href="mailto:aacid@kde.org">aacid@kde.org</a>>, "<a href="mailto:Albert@freedesktop.org">Albert@freedesktop.org</a>" <<a href="mailto:Albert@freedesktop.org">Albert@freedesktop.org</a>>, "<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>" <<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>><br><span style="font-weight:bold">Subject: </span> Re: [poppler] Extract title from pdf file.<br></div><div><br></div><div><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…</div><div><br></div><div>What you are doing is adding HEURISTICS into Poppler to GUESS at the logical structure of a PDF.  You are <span style="font-style: italic">NOT</span> actually taking into account any REAL LIVE logical structure that was put their by the PDF producer.  </div><div><br></div><div>PDF 1.3 is about 15 YEARS OLD.  NUMEROUS ADVANCES have been made to the format.  PDF is currently at 1.7, as standardized by the ISO and adopted as national standards by almost 50 countries around the world.  Version 2.0 (ISO 32000-2) is almost complete!  To work only with 1.3 is, honestly, a waste.  You are missing HUGE PIECES of functionality found in the majority of real-world documents.</div><div><br></div><div>I am sure your code is wonderful.  However, given that it is based on 1.3 and does not recognize existing PDF structure, it seems SEVERELY limited in real world use. </div><div><br></div><div>Leonard</div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Alec Taylor <<a href="mailto:alec.taylor6@gmail.com">alec.taylor6@gmail.com</a>><br><span style="font-weight:bold">Date: </span> Thu, 10 Nov 2011 13:57:54 -0800<br><span style="font-weight:bold">To: </span> Leonard Rosenthol <<a href="mailto:lrosenth@adobe.com">lrosenth@adobe.com</a>><br><span style="font-weight:bold">Cc: </span> "<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>" <<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>>, Albert Cid <<a href="mailto:aacid@kde.org">aacid@kde.org</a>><br><span style="font-weight:bold">Subject: </span> Re: [poppler] Extract title from pdf file.<br></div><div><br></div><p>As was previously mentioned, I am adding the semantic and logical structuring into poppler core.</p><p>My plan is to figure out what fits into which category by post processing the XML. Any suggestions on how to reverse [or post?!] engineer this XML back into the PDF would be appreciated.</p><p>In a few days I will have a very accurate XML genereated with <header></header>, <footer></footer> and table of contents tags.</p><p>This will involve the "pushing" of the actual "printed" page numbers, and adding hyperlink to each ToC entry, and partitioning the page structure as far as the 1.3 standard allows.</p><p>My code is extremely modular, neat & efficient, and included the writing of an OO API. So it should be easily extendable with author, title, publisher, year and section title extraction capabilities.</p></span></div></div></span></div></div></span></body></html>