[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text
Albert Astals Cid
aacid at kde.org
Sun Nov 9 07:48:01 PST 2014
El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va escriure:
> Hello,
>
> I read many pdf's texts. I don't do annotations popup but I only highlight
> text in yellow. I wanted to extract (with Python) this text to do some
> indexation with Whoosh after for my studies. I saw that when the text is
> highlihted the object created in the PDF's file is:
>
> 20 0 obj
>
> <<
>
> /C [1 1 0]
>
> /F 4
>
> /M (D:20141107203743+01'00')
>
> /P 7 0 R
>
> /T (bruno)
>
> /AP <<
>
> /N 31 0 R
>
>
>
> /NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
>
> /Rect [112.707338 807.385499 164.672639 816.770264]
>
> /Subj (Surligner)
>
> /Subtype /Highlight
>
> /QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
> 807.385508 162.809979 807.385508]
>
> /CreationDate (D:20141107203743+01'00')
>
>
>
> endobj`<<
>
> Unlike a classical annotations here there is not the key " /Contents" and it
> is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pyPoppler but
> but ... I am not very good and don't find the way to extract the line I
> want.
>
> My question:
>
> The key /QuadPoints can give me a link for the text highlighted ? Or is the
> key /Rect can do this ?
They are both "the same", seems in this case Rect has a bit more of "padding"
but they depict the same area.
Yes you should be able to use that rect to get the text in there.
Cheers,
Albert
>
> If somebody can give me some advices I will be happy.
>
> Thanks for your patience
>
> Bruno
More information about the poppler
mailing list