[poppler] Changes to 'tagged-pdf'

Carlos Garcia Campos carlosgc at kemper.freedesktop.org
Thu Aug 1 05:01:16 PDT 2013


New branch 'tagged-pdf' available with the following commits:
commit 6cee8a7e0a4f86df088e817b6a619db4a3956e85
Author: Adrian Perez de Castro <aperez at igalia.com>
Date:   Tue Jun 18 00:35:51 2013 +0300

    Tagged-PDF: Text content extraction from structure elements
    
    Implement StructElement::getText(), by using MCOutputDev. This output device
    captures the a sequence MCOp structures representing the text drawing
    operations for a particular marked content text object from the page stream.
    Those are then used to convert the individual Unicode characters to the
    returned string.

commit b12697b451a4f23c0e52b88eee01557a8f14092a
Author: Adrian Perez de Castro <aperez at igalia.com>
Date:   Tue Jun 18 00:24:21 2013 +0300

    Tagged-PDF: Implement parsing of StructElem attributes
    
    Parse attributes of StructElem nodes of the document structure tree.
    Both standard attributes and user properties are mapped to instances
    of the Attribute class. Attributes are parsed both via ClassMap
    references and directly referenced from the StructElem objects.

commit b571cd261a83ab8055dee78d7c8ad6b667249852
Author: Adrian Perez de Castro <aperez at igalia.com>
Date:   Mon Jun 17 23:20:04 2013 +0300

    Tagged-PDF: Implement parsing of StructElem objects
    
    Implement parsing of StructElem tree nodes from the document structure tree,
    each object is parsed as a StructElement instance. Attributes and extraction
    of content out from elements are not yet handled.

commit a0c0872415dbb640f0ebb7baef0c842794a7d455
Author: Adrian Perez de Castro <aperez at igalia.com>
Date:   Mon Jun 17 17:00:27 2013 +0300

    Tagged-PDF: Implement parsing of StructTreeRoot
    
    Implement parsing of the StructTreeRoot entry of the Catalog. Also, the
    Catalog::getStructTreeRoot() and PDFDoc::getStructTreeRoot() methods are
    modified to return an instance of StructTreeRoot instead of an Object.
    
    All elements from the StructTreeRoot are parsed except for:
    
    - IDTree: it is a lookup tree to locate items by their ID, which would
      be barely useful because the whole structure tree is to be kept in
      memory, which should be fast enough to traverse.
    - ParentTreeNextKey: This is needed only when the ParentTree object is
      to be modified. For the moment the implementation deals only with
      reading, so this has been deliberately left out.
    
    Also, pdfinfo is used to print tagging info from Catalog::getMarkInfo()
    instead opf assuming that the presence of the StrucTreeRoot implies that
    the file is tagged.



More information about the poppler mailing list