[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text
Albert Astals Cid
aacid at kde.org
Sun Nov 9 09:24:49 PST 2014
El Diumenge, 9 de novembre de 2014, a les 17:41:49, bruno gallart va escriure:
> Bon dia Albert e mercès per ta responta,
> (soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)
>
> Thanks for your response Albert,
>
> But I have readen the poppler's Api and I does not see the object and
> the method for this (/Rect ---> extract text with x,y coordonates). My
> question is quite boring but do you know the object that I must use to
> do this extraction ?
Using qt4 frontend i'd use Poppler::Page::text(rect)
Cheers,
Albert
>
> Thanks a lot
> Gràcies molt
>
> Bruno
>
> Le 09/11/2014 16:48, Albert Astals Cid a écrit :
> > El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va
escriure:
> >> Hello,
> >>
> >> I read many pdf's texts. I don't do annotations popup but I only
> >> highlight
> >> text in yellow. I wanted to extract (with Python) this text to do some
> >> indexation with Whoosh after for my studies. I saw that when the text is
> >> highlihted the object created in the PDF's file is:
> >>
> >> 20 0 obj
> >>
> >> <<
> >>
> >> /C [1 1 0]
> >>
> >> /F 4
> >>
> >> /M (D:20141107203743+01'00')
> >>
> >> /P 7 0 R
> >>
> >> /T (bruno)
> >>
> >> /AP <<
> >>
> >> /N 31 0 R
> >>
> >>
> >>
> >> /NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
> >>
> >> /Rect [112.707338 807.385499 164.672639 816.770264]
> >>
> >> /Subj (Surligner)
> >>
> >> /Subtype /Highlight
> >>
> >> /QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
> >> 807.385508 162.809979 807.385508]
> >>
> >> /CreationDate (D:20141107203743+01'00')
> >>
> >>
> >>
> >> endobj`<<
> >>
> >> Unlike a classical annotations here there is not the key " /Contents" and
> >> it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now
> >> pyPoppler but but ... I am not very good and don't find the way to
> >> extract the line I want.
> >>
> >> My question:
> >>
> >> The key /QuadPoints can give me a link for the text highlighted ? Or is
> >> the
> >> key /Rect can do this ?
> >
> > They are both "the same", seems in this case Rect has a bit more of
> > "padding" but they depict the same area.
> >
> > Yes you should be able to use that rect to get the text in there.
> >
> > Cheers,
> >
> > Albert
> >>
> >> If somebody can give me some advices I will be happy.
> >>
> >> Thanks for your patience
> >>
> >> Bruno
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list