[poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu Dec 20 09:10:06 UTC 2018


Hi,

Recently Jeroen Ooms asked me whether "links" in PDF could
be retrieved via cpp-frontend. Reading the sources, I found
some basic utilities are included in the sources already,
but I could not understand how to use them. Please let me
summarize my understanding of the current situation and ask
some questions.

1) "hyperlink" in PDF

In PDF, there is no straight-forward "hyperlink" which could
be dealt as "<a href='aaa'>bbb</a>". PDF can include "Annot"
objects; Annot object consists of the region and related actions.
If a HTML with hyperlink is converted to PDF via WebKit, its
hyperlinks are converted to the Annot which consists of the
rectangle region (overlapping with the annotated text, like,
bbb in the above example), and URI (aaa in the above example).

However, the text "bbb" itself is not the part of Annot object.
In fact, the hyperlink in the PDF is not always attached to
the text; it could be attached to the graphical object, or,
maybe, it could be attached to "nothing" (just the region to
be clicked is defined).

2) Annot in poppler

In poppler, there is a class "Annot". By the related actions,
there are several variants of Annot, like, AnnotPopup,
AnnotMarkup, AnnotMovie, AnnotScreen, and, AnnotLink.

Page object has a method getAnnots() which returns an object
listing the Annot objects in the page. By checking the subtype
of Annot objects, we can select AnnotLink objects only.

As written in above, AnnotLink object itself does not clarify
what objects the annotation is attached to. To identify the
text objects which given link info, TextPage::coalesce()
includes following code (executed if doHTML is true):

    //----- handle links
    for (i = 0; i < links->getLength(); ++i) {
      link = (TextLink *)links->get(i);

      // rot = 0
      if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
        startBaseIdx = pools[0]->getBaseIdx(link->yMin);
        endBaseIdx = pools[0]->getBaseIdx(link->yMax);
        for (j = startBaseIdx; j <= endBaseIdx; ++j) {
          for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
            if (link->xMin < word0->xMin + hyperlinkSlack &&
                word0->xMax - hyperlinkSlack < link->xMax &&
                link->yMin < word0->yMin + hyperlinkSlack &&
                word0->yMax - hyperlinkSlack < link->yMax) {
              word0->link = link->link;
            }
          }
        }
      }

If a word is found to be overlapping the region of AnnotLink,
the link property of TextWord object is set to URI. If it is
executed well, we can retrieve hyperlinked URIs for each word.

3) my question

TextPage::coalesce() assumes that TextPage object has "links"
property, a GooList of TextLink object. With given AnnotLink,
TextLink objects could be added by TextPage::addLink(). If
we pass AnnotLink object to TextOutputDev::processLink() method,
TextPage::addLink() is called internally.

My guessing scenario is something like this:
step 1) taking Page object, and getting Annots from it.
step 2) getting an Annot object from Annots object, and
if it is AnnotLink, pass it to TextOutputDev::processLink().
step 3) execute TextOutputDev::coalesce() and collect
the words.

Trying to apply this scenario to current poppler-cpp, I found
it is hard.

current poppler-cpp creates TextOutputDev and render the PDF
by PDFDoc::displayPageSlice() onto it. In displayPageSlice(),
Annot objects are handled like this.

  // draw annotations
  annotList = getAnnots();

  if (annotList->getNumAnnots() > 0) {
    if (globalParams->getPrintCommands()) {
      printf("***** Annotations\n");
    }
    for (i = 0; i < annotList->getNumAnnots(); ++i) {
        Annot *annot = annotList->getAnnot(i);
        if ((annotDisplayDecideCbk &&
             (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
            !annotDisplayDecideCbk) {
             annotList->getAnnot(i)->draw(gfx, printing);
        }
    }
    out->dump();
  }

It means that the Annot with visible shapes are cared, but
the objects like AnnotLink are not cared.

And, during displayPageSlice() process, Page object is built
and destroyed, so the AnnotLink inserted before the process
does not change the result (it is destroyed by the construction
of Page object).

Considering displayPageSlice() is not appropriate to reflect
AnnotLink, should I write something like displayPageSlice()
but slightly different to reflect AnnotLink?

If there is good example handling hyperlinks in PDF with
poppler library, please let me know.

Regards,
mpsuzuki



More information about the poppler mailing list