[poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Tue Dec 25 03:53:17 UTC 2018

Dear Albert,

Thank you for response!

Albert Astals Cid wrote:
>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
>> or, request the inclusion of HtmlOutputDev into poppler/ tree?

> Doing it is going to be a pain in the ass and heurisitics that will always break and people will always complain that your magic is not perfect and they want better magic.

Indeed.

> IMHO just provide a set of rectangles like the glib and qt frontends do.

I see. It is reasonable to do as other frontends.

> Also we should kill the enableHTMLExtras part since noone is using it.

Although the programs in xpdf do not use it, enableHTMLExtras() method is
defined in xpdf's original TextOutputDev. Thus, it could be considerable to keep
it until xpdf removes it, for better compatibility. The part in xpdf's
TextOutputDev enabled by doHTML, is being used by xpdf's pdftohtml; doHTML is
set during the construction of TextOutputDev. In poppler's constructor of
TextOutputDev does not manipulate doHTML, so enableHTMLExtras() is the only way
to manipulate it, for poppler users.

But, if poppler would suggest the users to use HtmlOutputDev instead of
TextOutputDev, to retrieve HTML-related info from PDF document, it would be
considerable option to remove doHTML-related part in TextOutputDev. But the
inclusion of HtmlOutputDev into libpoppler would be the first step to it.

Also, xpdf's source, there is ImageOutputDev. Is there any problem to include
poppler's ImageOutputDev into libpoppler?

Regards,
mpsuzuki

Albert Astals Cid wrote:
> El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va escriure:
>> Dear Leonard,
>>
>> Thank you for the sample of Tagged PDF!
>> I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) PDF.
>>
>> --
>>
>> TextOutputDev has an internal switch "doHTML" which controls Annot handling
>> if it's true. It is set to false by default, but it could be switched by
>> enableHTMLExtras() method. However, I cannot find the example in utils (and
>> I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).
>>
>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
>> or, request the inclusion of HtmlOutputDev into poppler/ tree?
> 
> Non tagged PDF doesn't have texts in links, so my recommendation is to not pretend it does, let the using application do the text<->rectangle merging if they want.
> 
> Doing it is going to be a pain in the ass and heurisitics that will always break and people will always complain that your magic is not perfect and they want better magic.
> 
> IMHO just provide a set of rectangles like the glib and qt frontends do.
> 
> Also we should kill the enableHTMLExtras part since noone is using it.
> 
> Cheers,
>   Albert
> 
> 
>> Regards,
>> mpsuzuki
>>
>> Leonard Rosenthol wrote:
>>> Here is one.
>>>
>>> Be aware that you MUST process the file according to the rules for Tagged PDF (aka walk the structure tree) and *NOT* using the content model (as the OutputDev's do in Poppler).
>>>
>>> Leonard
>>>
>>> -----Original Message-----
>>> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> 
>>> Sent: Thursday, December 20, 2018 8:02 AM
>>> To: Leonard Rosenthol <lrosenth at adobe.com>
>>> Cc: poppler at lists.freedesktop.org
>>> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
>>>
>>> Dear Leonard,
>>>
>>> Thank you very much for correction. I would try to find a sample of tagged PDF...
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>> Leonard Rosenthol wrote:
>>>> What you wrote in #1 below is true for non-tagged PDF.  When you have a tagged PDF - a PDF in which there is proper semantic structure - then the annotations (links and others) are directly connected to the object (text, image, etc.).
>>>>
>>>> Leonard
>>>>
>>>> -----Original Message-----
>>>> From: poppler <poppler-bounces at lists.freedesktop.org> On Behalf Of suzuki toshiya
>>>> Sent: Thursday, December 20, 2018 4:10 AM
>>>> To: poppler at lists.freedesktop.org
>>>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
>>>>
>>>> Hi,
>>>>
>>>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via cpp-frontend. Reading the sources, I found some basic utilities are included in the sources already, but I could not understand how to use them. Please let me summarize my understanding of the current situation and ask some questions.
>>>>
>>>> 1) "hyperlink" in PDF
>>>>
>>>> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a href='aaa'>bbb</a>". PDF can include "Annot"
>>>> objects; Annot object consists of the region and related actions.
>>>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are converted to the Annot which consists of the rectangle region (overlapping with the annotated text, like, bbb in the above example), and URI (aaa in the above example).
>>>>
>>>> However, the text "bbb" itself is not the part of Annot object.
>>>> In fact, the hyperlink in the PDF is not always attached to the text; it could be attached to the graphical object, or, maybe, it could be attached to "nothing" (just the region to be clicked is defined).
>>>>
>>>> 2) Annot in poppler
>>>>
>>>> In poppler, there is a class "Annot". By the related actions, there are several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, AnnotScreen, and, AnnotLink.
>>>>
>>>> Page object has a method getAnnots() which returns an object listing the Annot objects in the page. By checking the subtype of Annot objects, we can select AnnotLink objects only.
>>>>
>>>> As written in above, AnnotLink object itself does not clarify what objects the annotation is attached to. To identify the text objects which given link info, TextPage::coalesce() includes following code (executed if doHTML is true):
>>>>
>>>>     //----- handle links
>>>>     for (i = 0; i < links->getLength(); ++i) {
>>>>       link = (TextLink *)links->get(i);
>>>>
>>>>       // rot = 0
>>>>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
>>>>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
>>>>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
>>>>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
>>>>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
>>>>             if (link->xMin < word0->xMin + hyperlinkSlack &&
>>>>                 word0->xMax - hyperlinkSlack < link->xMax &&
>>>>                 link->yMin < word0->yMin + hyperlinkSlack &&
>>>>                 word0->yMax - hyperlinkSlack < link->yMax) {
>>>>               word0->link = link->link;
>>>>             }
>>>>           }
>>>>         }
>>>>       }
>>>>
>>>> If a word is found to be overlapping the region of AnnotLink, the link property of TextWord object is set to URI. If it is executed well, we can retrieve hyperlinked URIs for each word.
>>>>
>>>> 3) my question
>>>>
>>>> TextPage::coalesce() assumes that TextPage object has "links"
>>>> property, a GooList of TextLink object. With given AnnotLink, TextLink objects could be added by TextPage::addLink(). If we pass AnnotLink object to TextOutputDev::processLink() method,
>>>> TextPage::addLink() is called internally.
>>>>
>>>> My guessing scenario is something like this:
>>>> step 1) taking Page object, and getting Annots from it.
>>>> step 2) getting an Annot object from Annots object, and if it is AnnotLink, pass it to TextOutputDev::processLink().
>>>> step 3) execute TextOutputDev::coalesce() and collect the words.
>>>>
>>>> Trying to apply this scenario to current poppler-cpp, I found it is hard.
>>>>
>>>> current poppler-cpp creates TextOutputDev and render the PDF by PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are handled like this.
>>>>
>>>>   // draw annotations
>>>>   annotList = getAnnots();
>>>>
>>>>   if (annotList->getNumAnnots() > 0) {
>>>>     if (globalParams->getPrintCommands()) {
>>>>       printf("***** Annotations\n");
>>>>     }
>>>>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
>>>>         Annot *annot = annotList->getAnnot(i);
>>>>         if ((annotDisplayDecideCbk &&
>>>>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
>>>>             !annotDisplayDecideCbk) {
>>>>              annotList->getAnnot(i)->draw(gfx, printing);
>>>>         }
>>>>     }
>>>>     out->dump();
>>>>   }
>>>>
>>>> It means that the Annot with visible shapes are cared, but the objects like AnnotLink are not cared.
>>>>
>>>> And, during displayPageSlice() process, Page object is built and destroyed, so the AnnotLink inserted before the process does not change the result (it is destroyed by the construction of Page object).
>>>>
>>>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, should I write something like displayPageSlice() but slightly different to reflect AnnotLink?
>>>>
>>>> If there is good example handling hyperlinks in PDF with poppler library, please let me know.
>>>>
>>>> Regards,
>>>> mpsuzuki
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0
>>
> 
> 
> 
> 
>