[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text

Albert Astals Cid aacid at kde.org
Sun Nov 9 09:24:49 PST 2014


El Diumenge, 9 de novembre de 2014, a les 17:41:49, bruno gallart va escriure:
> Bon dia Albert e mercès per ta responta,
> (soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)
> 
> Thanks for your response Albert,
> 
> But I have readen the  poppler's Api and I does not see the object and
> the method for this (/Rect  ---> extract text with x,y coordonates). My
> question is quite boring but do you know the object that I must use to
> do this extraction ?

Using qt4 frontend i'd use Poppler::Page::text(rect)

Cheers,
  Albert

> 
> Thanks a lot
> Gràcies molt
> 
> Bruno
> 
> Le 09/11/2014 16:48, Albert Astals Cid a écrit :
> > El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va 
escriure:
> >> Hello,
> >> 
> >> I read many pdf's texts. I don't do annotations popup but I only
> >> highlight
> >> text in yellow. I wanted to extract (with Python) this text to do some
> >> indexation with Whoosh after for my studies. I saw that when the text is
> >> highlihted the object created in the PDF's file is:
> >> 
> >> 20 0 obj
> >> 
> >> <<
> >> 
> >> /C [1 1 0]
> >> 
> >> /F 4
> >> 
> >> /M (D:20141107203743+01'00')
> >> 
> >> /P 7 0 R
> >> 
> >> /T (bruno)
> >> 
> >> /AP <<
> >> 
> >> /N 31 0 R
> >> 
> >> 
> >> 
> >> /NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
> >> 
> >> /Rect [112.707338 807.385499 164.672639 816.770264]
> >> 
> >> /Subj (Surligner)
> >> 
> >> /Subtype /Highlight
> >> 
> >> /QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
> >> 807.385508 162.809979 807.385508]
> >> 
> >> /CreationDate (D:20141107203743+01'00')
> >> 
> >> 
> >> 
> >> endobj`<<
> >> 
> >> Unlike a classical annotations here there is not the key " /Contents" and
> >> it is my problem. I have tried pdfMiner, pyPDF, PyPDF2  and  now
> >> pyPoppler but but ... I am not very good and don't find the way to
> >> extract the line I want.
> >> 
> >> My question:
> >> 
> >> The key /QuadPoints can give me a link for the text highlighted ? Or is
> >> the
> >> key /Rect can do this ?
> > 
> > They are both "the same", seems in this case Rect has a bit more of
> > "padding" but they depict the same area.
> > 
> > Yes you should be able to use that rect to get the text in there.
> > 
> > Cheers,
> > 
> >    Albert
> >> 
> >> If somebody can give me some advices I will be happy.
> >> 
> >> Thanks for your patience
> >> 
> >> Bruno
> > 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list