[poppler] [PATCH 0/6] Tagged-PDF support for poppler and poppler-glib

Adrian Perez aperez at igalia.com
Wed May 29 16:47:25 PDT 2013


From: Adrian Perez de Castro <aperez at igalia.com>

Hello to all!

We have been working at Igalia to improve the accessibility support in
Evince, and as a part of this work we wanted Tagged-PDF support in poppler
(and poppler-glib). Apart from allowing accessibility technologies to know
the logical structure of documents and their layout attributes, supporting
this feature allows for a number of niceties like better support for
exporting to other formats, reflowing documents (for example in devices
with small screens), etc.

The first two patches add the actual low-level support. Let me elaborate
on this one, as it is the biggest (and most complex) part.

(Note: "/SlashedPrefix" names refer to actual items in the PDF.)

 * The structure tree is mostly read-only for the moment. There are some
   setters here and there, but no real support to create a structure
   tree programmatically. There is no support for writing to file either.

 * Catalog no longer uses an Object for the /StructTreeRoot, but an instance
   of StructTreeRoot.

 * Using Catalog::getStructTreeRoot() or PDFDoc::getStructTreeRoot() will
   parse the structure tree when first accessed. All the tree is created
   at once, the main work being done by StructTreeRoot::parse() and
   StructElement::parse().
   
 * The Catalog owns the StructTreeRoot and its StructElements.

 * StructTreeRoot keeps references to /ClassMap and /RoleMap, which may
   be needed when parsing /StructElem objects. This is why StructElement
   objects have a reference to their StructTreeRoot.

 * std::vector<> is used for lists of children elements in StructTreeRoot
   and StructElement. IMHO, it is better to use "a bit more modern C++"
   for code which does not need to be merged/rebased with changes coming
   from new potential Xpdf releases.

 * To extract the actual content referenced by an element, I have
   implemented a output device (MCOutputDev), which takes a MCID and
   records the painting operations in the page stream for that one
   MCID. Character drawing, changes in font faces, and font style
   (italics/bold/fixed-width) are recorded. Recording is done using
   MCOp structures (short for Marked Content Operation).

   Once a page (or range of pages) have been "displayed" using a
   MCOutputDev, the list of operations can be obtained with
   MCOutputDev::getMCOps(). As an example: to obtain only the text
   contents iterate over the list, and pick only the MCOp with
   type==mcOpUnichar, converting them into a string as you go.
   (There's Complete example of this in StructElement::getText()).
   
   Initially I tried subclassing TextOutputDev, but its innards are
   done in such a way that skipping the content and picking only the
   parts marked with a particular MCID would leave it in an inconsistent
   state -- making it segfault.

 * Parsing tries to be tolerant and continue reading as much information
   as possible before bailing bailing out. Nevertheless, checks on the
   parsed data are done and warnings are printed using errSyntaxError
   in a number of places.

Known issues / TODOs:

 * Lookups in /RoleMap should be recursive and able to detect loops.
   (I am working on this while waiting for feedback from the code
   review :D)

 * Object References (/OBJR) are not handled. I have not seen PDFs
   using those, with the references pointing to text. As the focus
   is improving accessibility, I left this out unimplemented for
   the moment.

 * Marked Content Reference objects (/MCR) can contain a reference
   to the exact stream which has the actual content. Those are
   ignored, as having the page reference (/Pg) and /MCID is enough.
   Also, it did look to me that it would be a bit cumbersome to
   interpret a single stream with the existing poppler APIs
   (Suggestions and hints are welcome!).

 * Attribute inheritance is not handled very well when the /Placement
   of an element is specified and it is other than the default (e.g.
   if an inline element like /Span has set /Placement/Block).

Other / Misc:

 * There is initial poppler-glib support, which exposes only a subset
   of the low-level functionality. I will be updating this in the
   next days, but I wanted to include it to have some feedback about
   the API.

 * Bonus: there is a patch to add a new pane in the poppler-glib
   demo with the document structure. It is a bit crude, but serves
   as an usage example of the API.

 * Bonus (x2): There is a patch for pdfinfo which will make it print
   the document structure when invoked as "pdfinfo -struct" or
   "pdfinfo -struct-text" (the later including each element's text).
   Very useful for debugging.

 * Bonus (x3): I have cleaned up some test code and used it to make
   a "pdfstructohtml" utility. It is very simplistic for the moment,
   yet the resulting HTML it produces is quite clean and neat for
   PDF files without an overly complex

That is all for now, I will be also attaching the patches to the relevant
bugs (which I created some days ago, as dependencies on a meta-bug [1]
tracking all the Tagged-PDF parts). All the feedback/critique you can
provide will be handy :-)

Best regards,

-Adrian


---
[1] https://bugs.freedesktop.org/show_bug.cgi?id=tagged-pdf

----


Adrian Perez de Castro (6):
  Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary
  Tagged-PDF: Interpret the document structure
  Tagged-PDF: Modify pdfinfo to show the document structure
  Tagged-PDF: Implement the utils/pdfstructtohtml tool
  Tagged-PDF: Expose the structure tree in poppler-glib
  Tagged-PDF: Pane in poppler-glib demo showing the structure

 glib/Makefile.am                    |    4 +
 glib/demo/Makefile.am               |    2 +
 glib/demo/main.c                    |    2 +
 glib/demo/taggedstruct.c            |  230 ++++++
 glib/demo/taggedstruct.h            |   31 +
 glib/poppler-document.cc            |   22 +
 glib/poppler-document.h             |    1 +
 glib/poppler-private.h              |   24 +
 glib/poppler-structure-element.cc   | 1289 +++++++++++++++++++++++++++++++++
 glib/poppler-structure-element.h    |  346 +++++++++
 glib/poppler-structure.cc           |  349 +++++++++
 glib/poppler-structure.h            |   43 ++
 glib/poppler.h                      |    3 +
 glib/reference/poppler-docs.sgml    |    2 +
 glib/reference/poppler-sections.txt |   86 +++
 glib/reference/poppler.types        |    2 +
 poppler/Catalog.cc                  |   81 ++-
 poppler/Catalog.h                   |   15 +-
 poppler/MCOutputDev.cc              |  145 ++++
 poppler/MCOutputDev.h               |  108 +++
 poppler/Makefile.am                 |    6 +
 poppler/PDFDoc.h                    |    3 +-
 poppler/StructElement.cc            | 1361 +++++++++++++++++++++++++++++++++++
 poppler/StructElement.h             |  273 +++++++
 poppler/StructTreeRoot.cc           |  120 +++
 poppler/StructTreeRoot.h            |   56 ++
 utils/Makefile.am                   |    5 +
 utils/pdfinfo.cc                    |   97 ++-
 utils/pdfstructtohtml.cc            |  387 ++++++++++
 29 files changed, 5074 insertions(+), 19 deletions(-)
 create mode 100644 glib/demo/taggedstruct.c
 create mode 100644 glib/demo/taggedstruct.h
 create mode 100644 glib/poppler-structure-element.cc
 create mode 100644 glib/poppler-structure-element.h
 create mode 100644 glib/poppler-structure.cc
 create mode 100644 glib/poppler-structure.h
 create mode 100644 poppler/MCOutputDev.cc
 create mode 100644 poppler/MCOutputDev.h
 create mode 100644 poppler/StructElement.cc
 create mode 100644 poppler/StructElement.h
 create mode 100644 poppler/StructTreeRoot.cc
 create mode 100644 poppler/StructTreeRoot.h
 create mode 100644 utils/pdfstructtohtml.cc

-- 
1.8.3



More information about the poppler mailing list