[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text

bruno gallart bruno.gallart at orange.fr
Sun Nov 9 01:38:08 PST 2014


Hello,

I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
highlihted the object created in the PDF's file is: 

20 0 obj

<< 

/C [1 1 0]

/F 4

/M (D:20141107203743+01'00')

/P 7 0 R

/T (bruno)

/AP <<

/N 31 0 R

>> 

/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)

/Rect [112.707338 807.385499 164.672639 816.770264]

/Subj (Surligner)

/Subtype /Highlight

/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]

/CreationDate (D:20141107203743+01'00')

>> 

endobj`<<

Unlike a classical annotations here there is not the key " /Contents" and it
is my problem. I have tried pdfMiner, pyPDF, PyPDF2  and  now pyPoppler but
but ... I am not very good and don't find the way to extract the line I
want.

My question:

The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?

If somebody can give me some advices I will be happy.

Thanks for your patience 

Bruno

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20141109/d2a13740/attachment.html>


More information about the poppler mailing list