[poppler] [PATCH 0/6] Tagged-PDF support for poppler and poppler-glib
Adrian Perez
aperez at igalia.com
Wed May 29 16:47:25 PDT 2013
From: Adrian Perez de Castro <aperez at igalia.com>
Hello to all!
We have been working at Igalia to improve the accessibility support in
Evince, and as a part of this work we wanted Tagged-PDF support in poppler
(and poppler-glib). Apart from allowing accessibility technologies to know
the logical structure of documents and their layout attributes, supporting
this feature allows for a number of niceties like better support for
exporting to other formats, reflowing documents (for example in devices
with small screens), etc.
The first two patches add the actual low-level support. Let me elaborate
on this one, as it is the biggest (and most complex) part.
(Note: "/SlashedPrefix" names refer to actual items in the PDF.)
* The structure tree is mostly read-only for the moment. There are some
setters here and there, but no real support to create a structure
tree programmatically. There is no support for writing to file either.
* Catalog no longer uses an Object for the /StructTreeRoot, but an instance
of StructTreeRoot.
* Using Catalog::getStructTreeRoot() or PDFDoc::getStructTreeRoot() will
parse the structure tree when first accessed. All the tree is created
at once, the main work being done by StructTreeRoot::parse() and
StructElement::parse().
* The Catalog owns the StructTreeRoot and its StructElements.
* StructTreeRoot keeps references to /ClassMap and /RoleMap, which may
be needed when parsing /StructElem objects. This is why StructElement
objects have a reference to their StructTreeRoot.
* std::vector<> is used for lists of children elements in StructTreeRoot
and StructElement. IMHO, it is better to use "a bit more modern C++"
for code which does not need to be merged/rebased with changes coming
from new potential Xpdf releases.
* To extract the actual content referenced by an element, I have
implemented a output device (MCOutputDev), which takes a MCID and
records the painting operations in the page stream for that one
MCID. Character drawing, changes in font faces, and font style
(italics/bold/fixed-width) are recorded. Recording is done using
MCOp structures (short for Marked Content Operation).
Once a page (or range of pages) have been "displayed" using a
MCOutputDev, the list of operations can be obtained with
MCOutputDev::getMCOps(). As an example: to obtain only the text
contents iterate over the list, and pick only the MCOp with
type==mcOpUnichar, converting them into a string as you go.
(There's Complete example of this in StructElement::getText()).
Initially I tried subclassing TextOutputDev, but its innards are
done in such a way that skipping the content and picking only the
parts marked with a particular MCID would leave it in an inconsistent
state -- making it segfault.
* Parsing tries to be tolerant and continue reading as much information
as possible before bailing bailing out. Nevertheless, checks on the
parsed data are done and warnings are printed using errSyntaxError
in a number of places.
Known issues / TODOs:
* Lookups in /RoleMap should be recursive and able to detect loops.
(I am working on this while waiting for feedback from the code
review :D)
* Object References (/OBJR) are not handled. I have not seen PDFs
using those, with the references pointing to text. As the focus
is improving accessibility, I left this out unimplemented for
the moment.
* Marked Content Reference objects (/MCR) can contain a reference
to the exact stream which has the actual content. Those are
ignored, as having the page reference (/Pg) and /MCID is enough.
Also, it did look to me that it would be a bit cumbersome to
interpret a single stream with the existing poppler APIs
(Suggestions and hints are welcome!).
* Attribute inheritance is not handled very well when the /Placement
of an element is specified and it is other than the default (e.g.
if an inline element like /Span has set /Placement/Block).
Other / Misc:
* There is initial poppler-glib support, which exposes only a subset
of the low-level functionality. I will be updating this in the
next days, but I wanted to include it to have some feedback about
the API.
* Bonus: there is a patch to add a new pane in the poppler-glib
demo with the document structure. It is a bit crude, but serves
as an usage example of the API.
* Bonus (x2): There is a patch for pdfinfo which will make it print
the document structure when invoked as "pdfinfo -struct" or
"pdfinfo -struct-text" (the later including each element's text).
Very useful for debugging.
* Bonus (x3): I have cleaned up some test code and used it to make
a "pdfstructohtml" utility. It is very simplistic for the moment,
yet the resulting HTML it produces is quite clean and neat for
PDF files without an overly complex
That is all for now, I will be also attaching the patches to the relevant
bugs (which I created some days ago, as dependencies on a meta-bug [1]
tracking all the Tagged-PDF parts). All the feedback/critique you can
provide will be handy :-)
Best regards,
-Adrian
---
[1] https://bugs.freedesktop.org/show_bug.cgi?id=tagged-pdf
----
Adrian Perez de Castro (6):
Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary
Tagged-PDF: Interpret the document structure
Tagged-PDF: Modify pdfinfo to show the document structure
Tagged-PDF: Implement the utils/pdfstructtohtml tool
Tagged-PDF: Expose the structure tree in poppler-glib
Tagged-PDF: Pane in poppler-glib demo showing the structure
glib/Makefile.am | 4 +
glib/demo/Makefile.am | 2 +
glib/demo/main.c | 2 +
glib/demo/taggedstruct.c | 230 ++++++
glib/demo/taggedstruct.h | 31 +
glib/poppler-document.cc | 22 +
glib/poppler-document.h | 1 +
glib/poppler-private.h | 24 +
glib/poppler-structure-element.cc | 1289 +++++++++++++++++++++++++++++++++
glib/poppler-structure-element.h | 346 +++++++++
glib/poppler-structure.cc | 349 +++++++++
glib/poppler-structure.h | 43 ++
glib/poppler.h | 3 +
glib/reference/poppler-docs.sgml | 2 +
glib/reference/poppler-sections.txt | 86 +++
glib/reference/poppler.types | 2 +
poppler/Catalog.cc | 81 ++-
poppler/Catalog.h | 15 +-
poppler/MCOutputDev.cc | 145 ++++
poppler/MCOutputDev.h | 108 +++
poppler/Makefile.am | 6 +
poppler/PDFDoc.h | 3 +-
poppler/StructElement.cc | 1361 +++++++++++++++++++++++++++++++++++
poppler/StructElement.h | 273 +++++++
poppler/StructTreeRoot.cc | 120 +++
poppler/StructTreeRoot.h | 56 ++
utils/Makefile.am | 5 +
utils/pdfinfo.cc | 97 ++-
utils/pdfstructtohtml.cc | 387 ++++++++++
29 files changed, 5074 insertions(+), 19 deletions(-)
create mode 100644 glib/demo/taggedstruct.c
create mode 100644 glib/demo/taggedstruct.h
create mode 100644 glib/poppler-structure-element.cc
create mode 100644 glib/poppler-structure-element.h
create mode 100644 glib/poppler-structure.cc
create mode 100644 glib/poppler-structure.h
create mode 100644 poppler/MCOutputDev.cc
create mode 100644 poppler/MCOutputDev.h
create mode 100644 poppler/StructElement.cc
create mode 100644 poppler/StructElement.h
create mode 100644 poppler/StructTreeRoot.cc
create mode 100644 poppler/StructTreeRoot.h
create mode 100644 utils/pdfstructtohtml.cc
--
1.8.3
More information about the poppler
mailing list