<p>As was previously mentioned, I am adding the semantic and logical structuring into poppler core.</p> <p>My plan is to figure out what fits into which category by post processing the XML. Any suggestions on how to reverse [or post?!] engineer this XML back into the PDF would be appreciated.</p> <p>In a few days I will have a very accurate XML genereated with <header></header>, <footer></footer> and table of contents tags.</p> <p>This will involve the "pushing" of the actual "printed" page numbers, and adding hyperlink to each ToC entry, and partitioning the page structure as far as the 1.3 standard allows.</p> <p>My code is extremely modular, neat & efficient, and included the writing of an OO API. So it should be easily extendable with author, title, publisher, year and section title extraction capabilities.</p>