[poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Thu Dec 27 17:40:15 UTC 2018

El dimarts, 25 de desembre de 2018, a les 9:56:41 CET, Adam Reichold va escriure:
> Hello mpsuzuki,
> 
> Am 25.12.18 um 04:53 schrieb suzuki toshiya:
> > Dear Albert,
> > 
> > Thank you for response!
> > 
> > Albert Astals Cid wrote:
> >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
> >>> or, request the inclusion of HtmlOutputDev into poppler/ tree?
> > 
> >> Doing it is going to be a pain in the ass and heurisitics that will always break and people will always complain that your magic is not perfect and they want better magic.
> > 
> > Indeed.
> > 
> >> IMHO just provide a set of rectangles like the glib and qt frontends do.
> > 
> > I see. It is reasonable to do as other frontends.
> 
> We might even want to factor out some common functionality used for link
> extraction into the core Poppler code to avoid copy&pasting too much code.
> 
> >> Also we should kill the enableHTMLExtras part since noone is using it.
> > 
> > Although the programs in xpdf do not use it, enableHTMLExtras() method is
> > defined in xpdf's original TextOutputDev. Thus, it could be considerable to keep
> > it until xpdf removes it, for better compatibility. The part in xpdf's
> > TextOutputDev enabled by doHTML, is being used by xpdf's pdftohtml; doHTML is
> > set during the construction of TextOutputDev. In poppler's constructor of
> > TextOutputDev does not manipulate doHTML, so enableHTMLExtras() is the only way
> > to manipulate it, for poppler users.
> 
> I do not think source compatibility with xpdf really exists anymore in
> Poppler. And even were it does, using it is highly discouraged since
> there are no API or ABI compatibility guarantees. So IMHO, we should
> focus on cleaning up the core as much as possible while trying to be
> very responsive to the needs of consuming projects in the frontend
> libraries.
> 
> > But, if poppler would suggest the users to use HtmlOutputDev instead of
> > TextOutputDev, to retrieve HTML-related info from PDF document, it would be
> > considerable option to remove doHTML-related part in TextOutputDev. But the
> > inclusion of HtmlOutputDev into libpoppler would be the first step to it.
> 
> Yes, I think using HtmlOutputDev is preferred for the use case discussed
> here. Hence the doHTML-related parts of TextOutputDev should be removed
> AFAIU.

If someone had lots of time, it'd be good to know how HtmlOutputDev compares to TextOutputDev-with-html-enabled.

But given our pdftohtml has been using HtmlOutputDev unless TextOutputDev-with-html-enabled was muuuuuuuuuuuuuuuuuuuuuuuch better, it's not good to change behaviour either.

> 
> > Also, xpdf's source, there is ImageOutputDev. Is there any problem to include
> > poppler's ImageOutputDev into libpoppler?
> 
> I think that ImageOutputDev and HtmlOutputDev are living in utils/
> instead of poppler/ is just a way of keeping poppler/ smaller as only
> the utilities use these classes. But I certainly see no technical
> reasons to not move these output devices into the core library.

We can move it to poppler/, but bear in mind we don't want people to use poppler/ so moving stuff there without a real plan on out the glib/qt/cpp frontends would use it is probably not the best of ideas.

Cheers,
  Albert

> 
> > Regards,
> > mpsuzuki
> 
> Best regards,
> Adam
> 
> > Albert Astals Cid wrote:
> >> El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va escriure:
> >>> Dear Leonard,
> >>>
> >>> Thank you for the sample of Tagged PDF!
> >>> I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) PDF.
> >>>
> >>> --
> >>>
> >>> TextOutputDev has an internal switch "doHTML" which controls Annot handling
> >>> if it's true. It is set to false by default, but it could be switched by
> >>> enableHTMLExtras() method. However, I cannot find the example in utils (and
> >>> I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).
> >>>
> >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
> >>> or, request the inclusion of HtmlOutputDev into poppler/ tree?
> >>
> >> Non tagged PDF doesn't have texts in links, so my recommendation is to not pretend it does, let the using application do the text<->rectangle merging if they want.
> >>
> >> Doing it is going to be a pain in the ass and heurisitics that will always break and people will always complain that your magic is not perfect and they want better magic.
> >>
> >> IMHO just provide a set of rectangles like the glib and qt frontends do.
> >>
> >> Also we should kill the enableHTMLExtras part since noone is using it.
> >>
> >> Cheers,
> >>   Albert
> >>
> >>
> >>> Regards,
> >>> mpsuzuki
> >>>
> >>> Leonard Rosenthol wrote:
> >>>> Here is one.
> >>>>
> >>>> Be aware that you MUST process the file according to the rules for Tagged PDF (aka walk the structure tree) and *NOT* using the content model (as the OutputDev's do in Poppler).
> >>>>
> >>>> Leonard
> >>>>
> >>>> -----Original Message-----
> >>>> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> 
> >>>> Sent: Thursday, December 20, 2018 8:02 AM
> >>>> To: Leonard Rosenthol <lrosenth at adobe.com>
> >>>> Cc: poppler at lists.freedesktop.org
> >>>> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
> >>>>
> >>>> Dear Leonard,
> >>>>
> >>>> Thank you very much for correction. I would try to find a sample of tagged PDF...
> >>>>
> >>>> Regards,
> >>>> mpsuzuki
> >>>>
> >>>> Leonard Rosenthol wrote:
> >>>>> What you wrote in #1 below is true for non-tagged PDF.  When you have a tagged PDF - a PDF in which there is proper semantic structure - then the annotations (links and others) are directly connected to the object (text, image, etc.).
> >>>>>
> >>>>> Leonard
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: poppler <poppler-bounces at lists.freedesktop.org> On Behalf Of suzuki toshiya
> >>>>> Sent: Thursday, December 20, 2018 4:10 AM
> >>>>> To: poppler at lists.freedesktop.org
> >>>>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via cpp-frontend. Reading the sources, I found some basic utilities are included in the sources already, but I could not understand how to use them. Please let me summarize my understanding of the current situation and ask some questions.
> >>>>>
> >>>>> 1) "hyperlink" in PDF
> >>>>>
> >>>>> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a href='aaa'>bbb</a>". PDF can include "Annot"
> >>>>> objects; Annot object consists of the region and related actions.
> >>>>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are converted to the Annot which consists of the rectangle region (overlapping with the annotated text, like, bbb in the above example), and URI (aaa in the above example).
> >>>>>
> >>>>> However, the text "bbb" itself is not the part of Annot object.
> >>>>> In fact, the hyperlink in the PDF is not always attached to the text; it could be attached to the graphical object, or, maybe, it could be attached to "nothing" (just the region to be clicked is defined).
> >>>>>
> >>>>> 2) Annot in poppler
> >>>>>
> >>>>> In poppler, there is a class "Annot". By the related actions, there are several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, AnnotScreen, and, AnnotLink.
> >>>>>
> >>>>> Page object has a method getAnnots() which returns an object listing the Annot objects in the page. By checking the subtype of Annot objects, we can select AnnotLink objects only.
> >>>>>
> >>>>> As written in above, AnnotLink object itself does not clarify what objects the annotation is attached to. To identify the text objects which given link info, TextPage::coalesce() includes following code (executed if doHTML is true):
> >>>>>
> >>>>>     //----- handle links
> >>>>>     for (i = 0; i < links->getLength(); ++i) {
> >>>>>       link = (TextLink *)links->get(i);
> >>>>>
> >>>>>       // rot = 0
> >>>>>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
> >>>>>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
> >>>>>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
> >>>>>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
> >>>>>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
> >>>>>             if (link->xMin < word0->xMin + hyperlinkSlack &&
> >>>>>                 word0->xMax - hyperlinkSlack < link->xMax &&
> >>>>>                 link->yMin < word0->yMin + hyperlinkSlack &&
> >>>>>                 word0->yMax - hyperlinkSlack < link->yMax) {
> >>>>>               word0->link = link->link;
> >>>>>             }
> >>>>>           }
> >>>>>         }
> >>>>>       }
> >>>>>
> >>>>> If a word is found to be overlapping the region of AnnotLink, the link property of TextWord object is set to URI. If it is executed well, we can retrieve hyperlinked URIs for each word.
> >>>>>
> >>>>> 3) my question
> >>>>>
> >>>>> TextPage::coalesce() assumes that TextPage object has "links"
> >>>>> property, a GooList of TextLink object. With given AnnotLink, TextLink objects could be added by TextPage::addLink(). If we pass AnnotLink object to TextOutputDev::processLink() method,
> >>>>> TextPage::addLink() is called internally.
> >>>>>
> >>>>> My guessing scenario is something like this:
> >>>>> step 1) taking Page object, and getting Annots from it.
> >>>>> step 2) getting an Annot object from Annots object, and if it is AnnotLink, pass it to TextOutputDev::processLink().
> >>>>> step 3) execute TextOutputDev::coalesce() and collect the words.
> >>>>>
> >>>>> Trying to apply this scenario to current poppler-cpp, I found it is hard.
> >>>>>
> >>>>> current poppler-cpp creates TextOutputDev and render the PDF by PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are handled like this.
> >>>>>
> >>>>>   // draw annotations
> >>>>>   annotList = getAnnots();
> >>>>>
> >>>>>   if (annotList->getNumAnnots() > 0) {
> >>>>>     if (globalParams->getPrintCommands()) {
> >>>>>       printf("***** Annotations\n");
> >>>>>     }
> >>>>>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
> >>>>>         Annot *annot = annotList->getAnnot(i);
> >>>>>         if ((annotDisplayDecideCbk &&
> >>>>>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
> >>>>>             !annotDisplayDecideCbk) {
> >>>>>              annotList->getAnnot(i)->draw(gfx, printing);
> >>>>>         }
> >>>>>     }
> >>>>>     out->dump();
> >>>>>   }
> >>>>>
> >>>>> It means that the Annot with visible shapes are cared, but the objects like AnnotLink are not cared.
> >>>>>
> >>>>> And, during displayPageSlice() process, Page object is built and destroyed, so the AnnotLink inserted before the process does not change the result (it is destroyed by the construction of Page object).
> >>>>>
> >>>>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, should I write something like displayPageSlice() but slightly different to reflect AnnotLink?
> >>>>>
> >>>>> If there is good example handling hyperlinks in PDF with poppler library, please let me know.
> >>>>>
> >>>>> Regards,
> >>>>> mpsuzuki
> >>>>>
> >>>>> _______________________________________________
> >>>>> poppler mailing list
> >>>>> poppler at lists.freedesktop.org
> >>>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0
> >>> _______________________________________________
> >>> poppler mailing list
> >>> poppler at lists.freedesktop.org
> >>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0
> >>>
> >>
> >>
> >>
> >>
> >>
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
>