[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text
bruno gallart
bruno.gallart at orange.fr
Sun Nov 9 01:38:08 PST 2014
Hello,
I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
highlihted the object created in the PDF's file is:
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (bruno)
/AP <<
/N 31 0 R
>>
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
>>
endobj`<<
Unlike a classical annotations here there is not the key " /Contents" and it
is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pyPoppler but
but ... I am not very good and don't find the way to extract the line I
want.
My question:
The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?
If somebody can give me some advices I will be happy.
Thanks for your patience
Bruno
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20141109/d2a13740/attachment.html>
More information about the poppler
mailing list