[poppler] How to find link destinations in tagged pdfs

Shawn McMurdo shawn_mcmurdo at yahoo.com
Wed Oct 9 18:10:21 UTC 2019


Hi,
I'm new to poppler and PDF internals so forgive me if this is obvious.
I am trying to find all of the relevant information about links within a tagged PDF that either go to a place in the same document or go to a url.

I tried traversing the tree of StructElements starting from the StructTreeRoot like this:

  Catalog *catalog = doc->getCatalog();
  StructTreeRoot *root = catalog->getStructTreeRoot();
  unsigned numChildren = root->getNumChildren();
  for (unsigned i = 0; i < numChildren; i++) {
    StructElement *child = root->getChild(i);
    printChild(child); // recursive
  }

Partial output from a version 1.4 PDF looks like:

  Document
    L (block)
      LI (block)
        LBody (block)
          P (block):
            Link (inline):
              Object 86 0
      LI (block)
        LBody (block)
          P (block):
            Link (inline):
              Object 89 0
    L (block)
      LI (block)
        LBody (block)
          P (block):
            Link (inline):
              Object 88 0
      LI (block)
        LBody (block)
          P (block):
            Link (inline):
              Object 87 0

This finds the links which seem to have an object ref num (87 for example).
How can I find out the URI or destination location in the document for this link?

I have tried code similar to these 3 blocks:
(I don't really understand the difference between the first two as the naming is confusing.)

  // 1. DestNameTreeDest
  int numDests = doc->getCatalog()->numDestNameTree();
  for (int i = 0; i < numDests; i++) {
    LinkDest *dest = doc->getCatalog()->getDestNameTreeDest(i);
    // printf
  }

  // 2. DestsDest
  numDests = doc->getCatalog()->numDests();
  for (int i = 0; i < numDests; i++) {
    LinkDest *dest = doc->getCatalog()->getDestsDest(i);
    // printf
  }

  // 3. Annot
  for (int i = firstPage; i <= lastPage; i++) {
    Page *p = doc->getPage(i);
    Annots *annots = p->getAnnots();
    int numAnnots = annots->getNumAnnots();
    for (int x = 0; x < numAnnots; x++) {
      Annot *a = annots->getAnnot(x);
      int type = a->getType();
      if (type == Annot::typeLink) {
    AnnotLink *link = static_cast<AnnotLink *>(a);
    int kind = link->getAction()->getKind();
        if (kind == 0) {
          // GoTo
        } else if (kind == 3) {
          // URI
        }
      }
      int id = a->getId();
      const GooString *name = a->getName();
      const GooString *contents = a->getContents();
      // printf
    }
  }

When I run the code on a version 1.4 PDF containing both internal links and web links the first 2 blocks don't seem to find anything.
The last block finds the following:

  Annot 0 Type 2 (Link) Kind 0 (GoTo) Id: 86 Contents: 
  Annot 1 Type 2 (Link) Kind 3 (URI) Id: 87 Contents: 
  Annot 2 Type 2 (Link) Kind 3 (URI) Id: 88 Contents: 
  Annot 3 Type 2 (Link) Kind 0 (GoTo) Id: 89 Contents: 

When I run the code on a different version 1.5 PDF containing both internal and web links I see the following:

Page  Destination                 Name
   1 [ XYZ  346  209 null      ] "EN-05-10531.indd:Application Number:1832"
   1 [ XYZ  343  593 null      ] "EN-05-10531.indd:Welcome to the Social Security Benefit Application:1830"
---vvv--- Begin Page 1 Annots ---vvv---
Printing 5 Annots.
Annot 0 Type 2 (Link) Kind 3 (URI) Id: 177 Contents: 
Annot 1 Type 2 (Link) Kind 3 (URI) Id: 178 Contents: 
Annot 2 Type 2 (Link) Kind 3 (URI) Id: 179 Contents: 
Annot 3 Type 2 (Link) Kind 3 (URI) Id: 180 Contents: 
Annot 4 Type 2 (Link) Kind 3 (URI) Id: 181 Contents: 
---^^^--- End Page 1 Annots ---^^^---
   2 [ XYZ  311  256 null      ] "EN-05-10531.indd:Finishing Your Application:1835"
   2 [ XYZ   29  522 null      ] "EN-05-10531.indd:Questions About Your Benefits:1834"
   2 [ XYZ  311  707 null      ] "EN-05-10531.indd:Questions About Your Work:1833"
---vvv--- Begin Page 2 Annots ---vvv---
Printing 0 Annots.
---^^^--- End Page 2 Annots ---^^^---

This did seem to find the XYZ dests for the internal links but not any urls.

Can anyone help point me in the right direction?
Thanks.
Shawn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20191009/c768a8fd/attachment.html>


More information about the poppler mailing list