[Poppler-bugs] [Bug 103530] Support XMP metadata for title, author etc.

Tue Jul 10 22:03:35 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=103530

--- Comment #6 from Evangelos Rigas <e.rigas at cranfield.ac.uk> ---
Hello all,

First, some background information.
I have recently added support for XMP metadata in Evince.
I added a couple of helper functions to extract basic objects from the xml
and then wrapper methods to actually extract the information.

The implemented methods cover the following attributes:
Title, Subject, Keywords, Author, PDF/A or PDF/X format, License info, Creator,
Producer, and Dates (created, modified).

A day ago, I looked into adding support for extracting the PDF/A and PDF/X
version from the information dictionary, as some files don't have embedded
XMP metadata. Hence, in Evince PDF files are not recognised as PDF/A or PDF/X
if they lack XMP metadata.

However, the extraction of dictionary keys from the DOCINFO is trivial in
poppler, as the function is to read the dictionary is implemented and used for
the extraction of the title, author, etc.
Thus, I decided to add the support in poppler.
The result can be seen
https://gitlab.gnome.org/erigas/poppler/tree/pdf_subtype.
I plan to send a patch shortly.

Second, as I was looking through the codebase and the bugs of poppler for
hints, I stumbled upon this bug.
Read all your comments, and decided to have a look on porting my existing code
to poppler. 
The initial porting tests have worked. I haven't pushed the branch yet.
I plan to do in the following days.

So, here is what I have done up to now.
Please note the changes are only applied to the Glib backend.

Added libxml2 as a dependency.
Added glib/poppler-metadata.cc, glib/poppler-metadata.h
The first part of poppler-metadata.cc contains the necessary logic to read the
xml metadata.
The second part contains wrapper functions around the first ones.

As an example, consider the function to extract the author.
static char * xmp_metadata_get_author (xmlXPathContextPtr xpathCtx);
It requires a xml path context to read the xml tree and extract the author.
Then there is gchar * poppler_metadata_get_author (const gchar *metadata).
It needs the metadata object that contains the xml.
Opens an xml context and passes down to xmp_metadata_get_author to retrieve the
author.

Through the poppler-metadata.h only the poppler_metadata_get_* are exposed.

Furthermore, added helper methods in poppler-document.cc similar to the ones
already defined for the info dict.
Following the example above, here is the definition of the author method:

gchar *
poppler_document_get_author_from_xmp (PopplerDocument *document)
{
  gchar *pdfa = nullptr;
  gchar *metadata = poppler_document_get_metadata(document);

  pdfa = poppler_metadata_get_author (metadata);

  return pdfa;
}

Finally, here is my proposition on how to progress.
First, to address the concerns about the info dict and xmp metadata.

> 
> > For
> > instance, for author there is the info in the catalog and the info in the
> > XMP metadata. My understanding of the PDF spec says that if the XMP metadata
> > is present, then the catalog data should be ignored.
> 
> That is not correct, the rules are more complex as far as i remember.
> 

>From the PDF reference
(https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf)
under section H .3 in Appendix H.

I quote, 
> For backward compatibility, applications that create PDF 1.4 documents
> should include the metadata for a document in the document information
> dictionary as well as in the document’s metadata stream. Applications that
> support PDF 1.4 should check for the existence of a metadata stream and
> synchronize the information in it with that in the document information
> dictionary. The Adobe metadata framework provides a date stamp for
> metadata expressed in the framework. If this date stamp is equal to or later
> than the document modification date recorded in the document information dictionary,
> the metadata stream can be taken as authoritative. If, however,
> the document modification date recorded in the document
> information dictionary is later than the metadata stream’s date stamp, the
> document has likely been saved by an application that is not aware of PDF
> 1.4 metadata streams. In this case, information stored in the document
> information dictionary should be taken to override any semantically
> equivalent items in the metadata stream.

So I believe that the function as implemented now should remain as is for
backwards compatibility, however I propose some changes.
These can be implemented in a couple of stages.

First, add the functions to read the xmp metadata.
Probably at the same time, add a state variable (determined by the modification
dates) to indicate which information source is considered valid and using an if
statement to switch the current callbacks accordingly.
Thus, clients will work out of the box with an update.
Furthermore, I suggest to use the xmp values if the value from the information
dict is null.

At this first stage, information on the most basic information will be
extracted.
The XMP supports a plethora of additional information, such as contact address
(
Iptc4xmpCore:CreatorContactInfo), contact email address(es) (CiEmailWork), etc.
This can be added at a later date to poppler-metadata.cc.

For the next stage, the ability to update the xmp information, based on the
info dict as stated in the specification, must be added.

This is all I had to say.
I am open to suggestions.

My plan is to first add the PDF/A, PDF/X support.
Then I will finish the porting of the xmp metadata to conclude the first stage
of the XMP support.

Kind Regards,

Evangelos

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20180710/8dec9269/attachment.html>