[poppler] Extract title from pdf file.

Thu Nov 10 15:21:54 PST 2011

Hi Leonard, Josh and Albert,

On 11/11/2011, at 9:42 AM, Leonard Rosenthol wrote:

> Albert was looking in the wrong place :).   
> 
> Check for either the MarkInfo and/or StructTreeRoot key in the Catalog.   Logical Structure was introduced in PDF 1.3 and Tagged PDF in 1.4 – so these features aren't all that new.

It is true that these are not new.
It is also true, unfortunately, that many PDF-producing software
applications either:

  1. cannot embed this kind of information;
or
  2. can do some of it, but not all, and may not
     do it automatically for all documents;
or
  3. their users do not know how to do what is required to
     specify the appropriate Metadata and/or structure;
or
  4. maybe they do know how to, but could not be bothered
     to actually do so.

Without proper training on what is the purpose of metadata,
and why encoding document structure is important or useful,
then this situation is not going to change much.

> 
> They are generated by numerous PDF producers including (but not limited to) Adobe Acrobat, MS Office 2007 and later, OpenOffice, pdfTeX, etc.  These features are required in various international standards such as PDF/A-1a and PDF/A-2a as well as the new PDF/UA.

When one Prints a document to PDF (e.g. in Mac OS X) then a box comes up
allowing Metadata such as Title, Author, Subject, Keywords to be included.
But how many of your colleagues do you know who actually do anything but
accept the default strings?
For Title, the default is just the file name, without the '.' extension.
How useful is that? It adds nothing to what is know from the file name itself.

I'd expect the applications you list to be similar, but providing a sensible
title, but *only* if the author has done the right thing within the Word 
Processing application to declare a piece of text as being *the* title.

> 
> I wish they all used it too…Unfortunately, many less capable PDF producers don't support it.

And that is presumably where Alec's application comes in, for a bunch
of PDFs that were created using software that doesn't provide
adequate Metadata --- or the authors never bothered to use that feature.

So the aim should be for his software to:

  1.  check whether a document title exists already, 
      in the DocInfo dictionary, say;

if not then

  2.  try to find an appropriate piece of text within the document 
      by applying some heuristics,

  3.  write this into (a new version of) the PDF, making sure to
      put it into the correct data structure (i.e. dictionary).

  It should add other appropriate Metadata too, such as Modification
  date/time and whatever else in XMP is useful and appropriate.
  An RDF block of Metadata might be added as well, and perhaps
  even a Colour profile.
  I'm sure Leonard could suggest other things too.

Adding the complete document structure tree is probably asking too
much at this stage --- though that should be an ultimate aim.
This can be a highly complex task, adding such functionality
to existing PDF-producing software.

To give an example of how I'm working on this very task for pdfTeX 
--- in particular adding tagging of mathematical content --- 
take a look at this video of a talk that I gave recently at 
the TUG 2011 conference:

  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/

This is ongoing work, and I'd appreciate your comments.

All the best,

	Ross

> 
> Leonard
> 
> From: Josh Richardson <jric at chegg.com>
> Date: Thu, 10 Nov 2011 14:28:10 -0800
> To: Leonard Rosenthol <lrosenth at adobe.com>, Alec Taylor <alec.taylor6 at gmail.com>
> Cc: Albert Cid <aacid at kde.org>, "Albert at freedesktop.org" <Albert at freedesktop.org>, "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: Re: [poppler] Extract title from pdf file.
> 
> Leonard, I don't understand.  You say Alec is "missing HUGE PIECES of functionality found in the majority of real-world documents", but Albert says he has 1200 documents and none of them has markings.  So, which is it, or what is it that Alec's missing?
> 
> I've got access to more than 10k PDFs, published in the past year or two, which I'd be happy to check, if you can tell me how.  I'd be curious to know how many of them are taking advantage of these newer PDF features, and I'd LOVE it if they all were.  Sadly, my guess is that it's close to zero. :-(
> 
> --josh

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------