[poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Sat Dec 22 10:04:40 UTC 2018

Dear Leonard,

Thank you for the sample of Tagged PDF!
I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) PDF.

--

TextOutputDev has an internal switch "doHTML" which controls Annot handling
if it's true. It is set to false by default, but it could be switched by
enableHTMLExtras() method. However, I cannot find the example in utils (and
I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).

Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
or, request the inclusion of HtmlOutputDev into poppler/ tree?

Regards,
mpsuzuki

Leonard Rosenthol wrote:
> Here is one.
> 
> Be aware that you MUST process the file according to the rules for Tagged PDF (aka walk the structure tree) and *NOT* using the content model (as the OutputDev's do in Poppler).
> 
> Leonard
> 
> -----Original Message-----
> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> 
> Sent: Thursday, December 20, 2018 8:02 AM
> To: Leonard Rosenthol <lrosenth at adobe.com>
> Cc: poppler at lists.freedesktop.org
> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
> 
> Dear Leonard,
> 
> Thank you very much for correction. I would try to find a sample of tagged PDF...
> 
> Regards,
> mpsuzuki
> 
> Leonard Rosenthol wrote:
>> What you wrote in #1 below is true for non-tagged PDF.  When you have a tagged PDF - a PDF in which there is proper semantic structure - then the annotations (links and others) are directly connected to the object (text, image, etc.).
>>
>> Leonard
>>
>> -----Original Message-----
>> From: poppler <poppler-bounces at lists.freedesktop.org> On Behalf Of suzuki toshiya
>> Sent: Thursday, December 20, 2018 4:10 AM
>> To: poppler at lists.freedesktop.org
>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
>>
>> Hi,
>>
>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via cpp-frontend. Reading the sources, I found some basic utilities are included in the sources already, but I could not understand how to use them. Please let me summarize my understanding of the current situation and ask some questions.
>>
>> 1) "hyperlink" in PDF
>>
>> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a href='aaa'>bbb</a>". PDF can include "Annot"
>> objects; Annot object consists of the region and related actions.
>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are converted to the Annot which consists of the rectangle region (overlapping with the annotated text, like, bbb in the above example), and URI (aaa in the above example).
>>
>> However, the text "bbb" itself is not the part of Annot object.
>> In fact, the hyperlink in the PDF is not always attached to the text; it could be attached to the graphical object, or, maybe, it could be attached to "nothing" (just the region to be clicked is defined).
>>
>> 2) Annot in poppler
>>
>> In poppler, there is a class "Annot". By the related actions, there are several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, AnnotScreen, and, AnnotLink.
>>
>> Page object has a method getAnnots() which returns an object listing the Annot objects in the page. By checking the subtype of Annot objects, we can select AnnotLink objects only.
>>
>> As written in above, AnnotLink object itself does not clarify what objects the annotation is attached to. To identify the text objects which given link info, TextPage::coalesce() includes following code (executed if doHTML is true):
>>
>>     //----- handle links
>>     for (i = 0; i < links->getLength(); ++i) {
>>       link = (TextLink *)links->get(i);
>>
>>       // rot = 0
>>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
>>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
>>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
>>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
>>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
>>             if (link->xMin < word0->xMin + hyperlinkSlack &&
>>                 word0->xMax - hyperlinkSlack < link->xMax &&
>>                 link->yMin < word0->yMin + hyperlinkSlack &&
>>                 word0->yMax - hyperlinkSlack < link->yMax) {
>>               word0->link = link->link;
>>             }
>>           }
>>         }
>>       }
>>
>> If a word is found to be overlapping the region of AnnotLink, the link property of TextWord object is set to URI. If it is executed well, we can retrieve hyperlinked URIs for each word.
>>
>> 3) my question
>>
>> TextPage::coalesce() assumes that TextPage object has "links"
>> property, a GooList of TextLink object. With given AnnotLink, TextLink objects could be added by TextPage::addLink(). If we pass AnnotLink object to TextOutputDev::processLink() method,
>> TextPage::addLink() is called internally.
>>
>> My guessing scenario is something like this:
>> step 1) taking Page object, and getting Annots from it.
>> step 2) getting an Annot object from Annots object, and if it is AnnotLink, pass it to TextOutputDev::processLink().
>> step 3) execute TextOutputDev::coalesce() and collect the words.
>>
>> Trying to apply this scenario to current poppler-cpp, I found it is hard.
>>
>> current poppler-cpp creates TextOutputDev and render the PDF by PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are handled like this.
>>
>>   // draw annotations
>>   annotList = getAnnots();
>>
>>   if (annotList->getNumAnnots() > 0) {
>>     if (globalParams->getPrintCommands()) {
>>       printf("***** Annotations\n");
>>     }
>>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
>>         Annot *annot = annotList->getAnnot(i);
>>         if ((annotDisplayDecideCbk &&
>>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
>>             !annotDisplayDecideCbk) {
>>              annotList->getAnnot(i)->draw(gfx, printing);
>>         }
>>     }
>>     out->dump();
>>   }
>>
>> It means that the Annot with visible shapes are cared, but the objects like AnnotLink are not cared.
>>
>> And, during displayPageSlice() process, Page object is built and destroyed, so the AnnotLink inserted before the process does not change the result (it is destroyed by the construction of Page object).
>>
>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, should I write something like displayPageSlice() but slightly different to reflect AnnotLink?
>>
>> If there is good example handling hyperlinks in PDF with poppler library, please let me know.
>>
>> Regards,
>> mpsuzuki
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C66620837365c47c82f8808d666853c1d%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809119768183629&sdata=yO5IwiushoAUDujGo6SX%2Fjg4rfAfFM%2B7D2i1cPJeBj8%3D&reserved=0