[poppler] How to find link destinations in tagged pdfs
Shawn McMurdo
shawn_mcmurdo at yahoo.com
Wed Oct 9 18:10:21 UTC 2019
Hi,
I'm new to poppler and PDF internals so forgive me if this is obvious.
I am trying to find all of the relevant information about links within a tagged PDF that either go to a place in the same document or go to a url.
I tried traversing the tree of StructElements starting from the StructTreeRoot like this:
Catalog *catalog = doc->getCatalog();
StructTreeRoot *root = catalog->getStructTreeRoot();
unsigned numChildren = root->getNumChildren();
for (unsigned i = 0; i < numChildren; i++) {
StructElement *child = root->getChild(i);
printChild(child); // recursive
}
Partial output from a version 1.4 PDF looks like:
Document
L (block)
LI (block)
LBody (block)
P (block):
Link (inline):
Object 86 0
LI (block)
LBody (block)
P (block):
Link (inline):
Object 89 0
L (block)
LI (block)
LBody (block)
P (block):
Link (inline):
Object 88 0
LI (block)
LBody (block)
P (block):
Link (inline):
Object 87 0
This finds the links which seem to have an object ref num (87 for example).
How can I find out the URI or destination location in the document for this link?
I have tried code similar to these 3 blocks:
(I don't really understand the difference between the first two as the naming is confusing.)
// 1. DestNameTreeDest
int numDests = doc->getCatalog()->numDestNameTree();
for (int i = 0; i < numDests; i++) {
LinkDest *dest = doc->getCatalog()->getDestNameTreeDest(i);
// printf
}
// 2. DestsDest
numDests = doc->getCatalog()->numDests();
for (int i = 0; i < numDests; i++) {
LinkDest *dest = doc->getCatalog()->getDestsDest(i);
// printf
}
// 3. Annot
for (int i = firstPage; i <= lastPage; i++) {
Page *p = doc->getPage(i);
Annots *annots = p->getAnnots();
int numAnnots = annots->getNumAnnots();
for (int x = 0; x < numAnnots; x++) {
Annot *a = annots->getAnnot(x);
int type = a->getType();
if (type == Annot::typeLink) {
AnnotLink *link = static_cast<AnnotLink *>(a);
int kind = link->getAction()->getKind();
if (kind == 0) {
// GoTo
} else if (kind == 3) {
// URI
}
}
int id = a->getId();
const GooString *name = a->getName();
const GooString *contents = a->getContents();
// printf
}
}
When I run the code on a version 1.4 PDF containing both internal links and web links the first 2 blocks don't seem to find anything.
The last block finds the following:
Annot 0 Type 2 (Link) Kind 0 (GoTo) Id: 86 Contents:
Annot 1 Type 2 (Link) Kind 3 (URI) Id: 87 Contents:
Annot 2 Type 2 (Link) Kind 3 (URI) Id: 88 Contents:
Annot 3 Type 2 (Link) Kind 0 (GoTo) Id: 89 Contents:
When I run the code on a different version 1.5 PDF containing both internal and web links I see the following:
Page Destination Name
1 [ XYZ 346 209 null ] "EN-05-10531.indd:Application Number:1832"
1 [ XYZ 343 593 null ] "EN-05-10531.indd:Welcome to the Social Security Benefit Application:1830"
---vvv--- Begin Page 1 Annots ---vvv---
Printing 5 Annots.
Annot 0 Type 2 (Link) Kind 3 (URI) Id: 177 Contents:
Annot 1 Type 2 (Link) Kind 3 (URI) Id: 178 Contents:
Annot 2 Type 2 (Link) Kind 3 (URI) Id: 179 Contents:
Annot 3 Type 2 (Link) Kind 3 (URI) Id: 180 Contents:
Annot 4 Type 2 (Link) Kind 3 (URI) Id: 181 Contents:
---^^^--- End Page 1 Annots ---^^^---
2 [ XYZ 311 256 null ] "EN-05-10531.indd:Finishing Your Application:1835"
2 [ XYZ 29 522 null ] "EN-05-10531.indd:Questions About Your Benefits:1834"
2 [ XYZ 311 707 null ] "EN-05-10531.indd:Questions About Your Work:1833"
---vvv--- Begin Page 2 Annots ---vvv---
Printing 0 Annots.
---^^^--- End Page 2 Annots ---^^^---
This did seem to find the XYZ dests for the internal links but not any urls.
Can anyone help point me in the right direction?
Thanks.
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20191009/c768a8fd/attachment.html>
More information about the poppler
mailing list