[poppler] Changes to 'tagged-pdf'
Carlos Garcia Campos
carlosgc at kemper.freedesktop.org
Thu Aug 1 05:01:16 PDT 2013
New branch 'tagged-pdf' available with the following commits:
commit 6cee8a7e0a4f86df088e817b6a619db4a3956e85
Author: Adrian Perez de Castro <aperez at igalia.com>
Date: Tue Jun 18 00:35:51 2013 +0300
Tagged-PDF: Text content extraction from structure elements
Implement StructElement::getText(), by using MCOutputDev. This output device
captures the a sequence MCOp structures representing the text drawing
operations for a particular marked content text object from the page stream.
Those are then used to convert the individual Unicode characters to the
returned string.
commit b12697b451a4f23c0e52b88eee01557a8f14092a
Author: Adrian Perez de Castro <aperez at igalia.com>
Date: Tue Jun 18 00:24:21 2013 +0300
Tagged-PDF: Implement parsing of StructElem attributes
Parse attributes of StructElem nodes of the document structure tree.
Both standard attributes and user properties are mapped to instances
of the Attribute class. Attributes are parsed both via ClassMap
references and directly referenced from the StructElem objects.
commit b571cd261a83ab8055dee78d7c8ad6b667249852
Author: Adrian Perez de Castro <aperez at igalia.com>
Date: Mon Jun 17 23:20:04 2013 +0300
Tagged-PDF: Implement parsing of StructElem objects
Implement parsing of StructElem tree nodes from the document structure tree,
each object is parsed as a StructElement instance. Attributes and extraction
of content out from elements are not yet handled.
commit a0c0872415dbb640f0ebb7baef0c842794a7d455
Author: Adrian Perez de Castro <aperez at igalia.com>
Date: Mon Jun 17 17:00:27 2013 +0300
Tagged-PDF: Implement parsing of StructTreeRoot
Implement parsing of the StructTreeRoot entry of the Catalog. Also, the
Catalog::getStructTreeRoot() and PDFDoc::getStructTreeRoot() methods are
modified to return an instance of StructTreeRoot instead of an Object.
All elements from the StructTreeRoot are parsed except for:
- IDTree: it is a lookup tree to locate items by their ID, which would
be barely useful because the whole structure tree is to be kept in
memory, which should be fast enough to traverse.
- ParentTreeNextKey: This is needed only when the ParentTree object is
to be modified. For the moment the implementation deals only with
reading, so this has been deliberately left out.
Also, pdfinfo is used to print tagging info from Catalog::getMarkInfo()
instead opf assuming that the presence of the StrucTreeRoot implies that
the file is tagged.
More information about the poppler
mailing list