[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text

bruno gallart bruno.gallart at orange.fr
Sun Nov 9 09:38:27 PST 2014





Le 09/11/2014 18:24, Albert Astals Cid a écrit :
> El Diumenge, 9 de novembre de 2014, a les 17:41:49, bruno gallart va escriure:
>> Bon dia Albert e mercès per ta responta,
>> (soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)
>>
>> Thanks for your response Albert,
>>
>> But I have readen the  poppler's Api and I does not see the object and
>> the method for this (/Rect  ---> extract text with x,y coordonates). My
>> question is quite boring but do you know the object that I must use to
>> do this extraction ?

  Poppler::Page::text(rect) with the rect's coordonates. I have readen the API, I am going to try.
I am going to have a very good evening of programation now with pyPoppler. Thanks Albert

Cheers

Bruno


> Using qt4 frontend i'd use Poppler::Page::text(rect)
>
> Cheers,
>    Albert
>
>> Thanks a lot
>> Gràcies molt
>>
>> Bruno
>>
>> Le 09/11/2014 16:48, Albert Astals Cid a écrit :
>>> El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va
> escriure:
>>>> Hello,
>>>>
>>>> I read many pdf's texts. I don't do annotations popup but I only
>>>> highlight
>>>> text in yellow. I wanted to extract (with Python) this text to do some
>>>> indexation with Whoosh after for my studies. I saw that when the text is
>>>> highlihted the object created in the PDF's file is:
>>>>
>>>> 20 0 obj
>>>>
>>>> <<
>>>>
>>>> /C [1 1 0]
>>>>
>>>> /F 4
>>>>
>>>> /M (D:20141107203743+01'00')
>>>>
>>>> /P 7 0 R
>>>>
>>>> /T (bruno)
>>>>
>>>> /AP <<
>>>>
>>>> /N 31 0 R
>>>>
>>>>
>>>>
>>>> /NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
>>>>
>>>> /Rect [112.707338 807.385499 164.672639 816.770264]
>>>>
>>>> /Subj (Surligner)
>>>>
>>>> /Subtype /Highlight
>>>>
>>>> /QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
>>>> 807.385508 162.809979 807.385508]
>>>>
>>>> /CreationDate (D:20141107203743+01'00')
>>>>
>>>>
>>>>
>>>> endobj`<<
>>>>
>>>> Unlike a classical annotations here there is not the key " /Contents" and
>>>> it is my problem. I have tried pdfMiner, pyPDF, PyPDF2  and  now
>>>> pyPoppler but but ... I am not very good and don't find the way to
>>>> extract the line I want.
>>>>
>>>> My question:
>>>>
>>>> The key /QuadPoints can give me a link for the text highlighted ? Or is
>>>> the
>>>> key /Rect can do this ?
>>> They are both "the same", seems in this case Rect has a bit more of
>>> "padding" but they depict the same area.
>>>
>>> Yes you should be able to use that rect to get the text in there.
>>>
>>> Cheers,
>>>
>>>     Albert
>>>> If somebody can give me some advices I will be happy.
>>>>
>>>> Thanks for your patience
>>>>
>>>> Bruno
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list