[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text

Albert Astals Cid aacid at kde.org
Sun Nov 9 07:48:01 PST 2014


El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va escriure:
> Hello,
> 
> I read many pdf's texts. I don't do annotations popup but I only highlight
> text in yellow. I wanted to extract (with Python) this text to do some
> indexation with Whoosh after for my studies. I saw that when the text is
> highlihted the object created in the PDF's file is:
> 
> 20 0 obj
> 
> <<
> 
> /C [1 1 0]
> 
> /F 4
> 
> /M (D:20141107203743+01'00')
> 
> /P 7 0 R
> 
> /T (bruno)
> 
> /AP <<
> 
> /N 31 0 R
> 
> 
> 
> /NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
> 
> /Rect [112.707338 807.385499 164.672639 816.770264]
> 
> /Subj (Surligner)
> 
> /Subtype /Highlight
> 
> /QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
> 807.385508 162.809979 807.385508]
> 
> /CreationDate (D:20141107203743+01'00')
> 
> 
> 
> endobj`<<
> 
> Unlike a classical annotations here there is not the key " /Contents" and it
> is my problem. I have tried pdfMiner, pyPDF, PyPDF2  and  now pyPoppler but
> but ... I am not very good and don't find the way to extract the line I
> want.
> 
> My question:
> 
> The key /QuadPoints can give me a link for the text highlighted ? Or is the
> key /Rect can do this ?

They are both "the same", seems in this case Rect has a bit more of "padding" 
but they depict the same area.

Yes you should be able to use that rect to get the text in there.

Cheers,
  Albert

> 
> If somebody can give me some advices I will be happy.
> 
> Thanks for your patience
> 
> Bruno



More information about the poppler mailing list