[poppler] Extract title from pdf file.

Alec Taylor alec.taylor6 at gmail.com
Thu Nov 10 16:21:36 PST 2011


Thanks for your sound explanation Ross, it is the frameset from which
I have defined this problem.

On Fri, Nov 11, 2011 at 10:21 AM, Ross Moore <ross.moore at mq.edu.au> wrote:
> Hi Leonard, Josh and Albert,
>
> On 11/11/2011, at 9:42 AM, Leonard Rosenthol wrote:
>
>> Albert was looking in the wrong place :).
>>
>> Check for either the MarkInfo and/or StructTreeRoot key in the Catalog.   Logical Structure was introduced in PDF 1.3 and Tagged PDF in 1.4 – so these features aren't all that new.
>
> It is true that these are not new.
> It is also true, unfortunately, that many PDF-producing software
> applications either:
>
>  1. cannot embed this kind of information;
> or
>  2. can do some of it, but not all, and may not
>     do it automatically for all documents;
> or
>  3. their users do not know how to do what is required to
>     specify the appropriate Metadata and/or structure;
> or
>  4. maybe they do know how to, but could not be bothered
>     to actually do so.
>
> Without proper training on what is the purpose of metadata,
> and why encoding document structure is important or useful,
> then this situation is not going to change much.
>
>>
>> They are generated by numerous PDF producers including (but not limited to) Adobe Acrobat, MS Office 2007 and later, OpenOffice, pdfTeX, etc.  These features are required in various international standards such as PDF/A-1a and PDF/A-2a as well as the new PDF/UA.
>
> When one Prints a document to PDF (e.g. in Mac OS X) then a box comes up
> allowing Metadata such as Title, Author, Subject, Keywords to be included.
> But how many of your colleagues do you know who actually do anything but
> accept the default strings?
> For Title, the default is just the file name, without the '.' extension.
> How useful is that? It adds nothing to what is know from the file name itself.
>
> I'd expect the applications you list to be similar, but providing a sensible
> title, but *only* if the author has done the right thing within the Word
> Processing application to declare a piece of text as being *the* title.
>
>>
>> I wish they all used it too…Unfortunately, many less capable PDF producers don't support it.
>
> And that is presumably where Alec's application comes in, for a bunch
> of PDFs that were created using software that doesn't provide
> adequate Metadata --- or the authors never bothered to use that feature.
>
> So the aim should be for his software to:
>
>  1.  check whether a document title exists already,
>      in the DocInfo dictionary, say;
>
> if not then
>
>  2.  try to find an appropriate piece of text within the document
>      by applying some heuristics,
>
>  3.  write this into (a new version of) the PDF, making sure to
>      put it into the correct data structure (i.e. dictionary).

My project is to do with header/footer analysis, ToC analysis and the
imposition of a logical structure onto PDFs delimiting this
information.

The due date for completion is the 24th of this month, by then I will
have (at the very least): reliable, accurate header/footer extraction
into an XML file. I have already done the entire middleware ([input
pdf]->[output header/footer in XML]) and have implemented the entire
project in an OO API with proper manipulation where expected, and even
a new xmltohf project for parallel processing.

I will also include a paper outlining methodology, results and a
comparison with previous work.

Once I have released this project, any of you will be able to easily
extend the API with other information such as metadata relating to
title, author &etc.

What I plan to do for the next 13-days (apart from study and complete
two examinations for various unrelated tertiary studies, and complete
the final work for a conference I'm running) is: improve the accuracy
of my header/footer detection and push the information back into the
PDF. I should also have time to separate it into a ToC, and add it to
the bookmark "field" of the PDF.

Any advice on how I can reverse-engineer the XML into the PDF would be
very much appreciated. (i.e. what are the poppler library entry points
for inserting bookmarks, and imposing logical structures?)

Thanks for all suggestions,

Alec Taylor

>  It should add other appropriate Metadata too, such as Modification
>  date/time and whatever else in XMP is useful and appropriate.
>  An RDF block of Metadata might be added as well, and perhaps
>  even a Colour profile.
>  I'm sure Leonard could suggest other things too.
>
> Adding the complete document structure tree is probably asking too
> much at this stage --- though that should be an ultimate aim.
> This can be a highly complex task, adding such functionality
> to existing PDF-producing software.
>
> To give an example of how I'm working on this very task for pdfTeX
> --- in particular adding tagging of mathematical content ---
> take a look at this video of a talk that I gave recently at
> the TUG 2011 conference:
>
>  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/
>
> This is ongoing work, and I'd appreciate your comments.
>
> All the best,
>
>        Ross
>
>>
>> Leonard
>>
>> From: Josh Richardson <jric at chegg.com>
>> Date: Thu, 10 Nov 2011 14:28:10 -0800
>> To: Leonard Rosenthol <lrosenth at adobe.com>, Alec Taylor <alec.taylor6 at gmail.com>
>> Cc: Albert Cid <aacid at kde.org>, "Albert at freedesktop.org" <Albert at freedesktop.org>, "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
>> Subject: Re: [poppler] Extract title from pdf file.
>>
>> Leonard, I don't understand.  You say Alec is "missing HUGE PIECES of functionality found in the majority of real-world documents", but Albert says he has 1200 documents and none of them has markings.  So, which is it, or what is it that Alec's missing?
>>
>> I've got access to more than 10k PDFs, published in the past year or two, which I'd be happy to check, if you can tell me how.  I'd be curious to know how many of them are taking advantage of these newer PDF features, and I'd LOVE it if they all were.  Sadly, my guess is that it's close to zero. :-(
>>
>> --josh
>
> ------------------------------------------------------------------------
> Ross Moore                                       ross.moore at mq.edu.au
> Mathematics Department                           office: E7A-419
> Macquarie University                             tel: +61 (0)2 9850 8955
> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>


More information about the poppler mailing list