[poppler] RFC: whole-page search in the qt4 frontend

Adam Reichold adamreichold at myopera.com
Thu Jun 28 14:34:49 PDT 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 28.06.2012 22:55, Albert Astals Cid wrote:
> El Dijous, 28 de juny de 2012, a les 20:38:45, Adam Reichold va
> escriure: On 28.06.2012 20:35, Albert Astals Cid wrote:
>>>> El Dijous, 28 de juny de 2012, a les 20:16:30, Adam Reichold
>>>> va escriure: Hello,
>>>> 
>>>> On 28.06.2012 19:36, Albert Astals Cid wrote:
>>>>>>> El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar 
>>>>>>> `Philips`
>>>>>>> 
>>>>>>> Filipau va escriure:
>>>>>>>> On 6/28/12, Adam Reichold <adamreichold at myopera.com>
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>>> If I remember correctly, some time ago someone
>>>>>>>>> proposed caching the TextOuputDev/TextPage used in 
>>>>>>>>> Poppler::Page::search to improve performance.
>>>>>>>>> Instead, I would propose to add another search
>>>>>>>>> method to Poppler::Page which searches the whole
>>>>>>>>> page at once and returns a list of all occurrences.
>>>>>>>>> [>>snip<<] Testing this with some sample files
>>>>>>>>> shows large improvements (above 100% as measured by
>>>>>>>>> runtime) for searching the whole document and
>>>>>>>>> especially for short phrases that occur often.
>>>>>>>>> 
>>>>>>>>> Thanks for any comments and advice. Best regards,
>>>>>>>>> Adam.
>>>>>>>> 
>>>>>>>> That was me. Use-case: I was checking results of
>>>>>>>> conversion of large PDF into a e-book.
>>>>>>>> 
>>>>>>>> PDF was 600+ pages long book: 325K words in total,
>>>>>>>> 20K unique.(*) Problem was (and is) that there is no
>>>>>>>> way to point at piece of text in the PDF - search was
>>>>>>>> (and is) the only option. Conversion produced around
>>>>>>>> 200 warnings - and I had to check them all. Meaning:
>>>>>>>> 200 times searching for a group of words in 600+ page
>>>>>>>> document. IIRC it was taking 6-7 seconds per search
>>>>>>>> in the Okular (up-to-date version from Debian Sid).
>>>>>>>> (Other PDF viewers haven't fared better. But the
>>>>>>>> multi-word search is unique to Okular and was the 
>>>>>>>> reason why I used it exclusively.)
>>>>>>>> 
>>>>>>>> Any speed up would have been extremely helpful. :)
>>>>>>> 
>>>>>>> This won't help Okular at all.
>>>>>>> 
>>>>>>> Cheers, Albert
>>>> 
>>>> I see. Would you consider including it (if deemed technically
>>>> fit) anyway?
>>>> 
>>>>> Including what? Your patch? in poppler or in okular?
> 
> The patch adding the whole-page search function to the qt4 frontend
> of poppler.
> 
>> I'll have a look at that asap (which might not be very soon) but
>> i don't see why not if the code is fine.
> 
>> Cheers, Albert
> 

Thanks, for this and all the hard work on poppler.

> 
> Best regards, Adam.
> 
>>>>> Albert
>>>> 
>>>> Best regards, Adam.
>>>> 
>>>>>>>> Though the most annoying part was not the waiting
>>>>>>>> time - checking manually 200+ warnings never going to
>>>>>>>> be fast - it was that my CPU fan stared spinning up
>>>>>>>> loudly: those 6-7 seconds were seconds when Okular
>>>>>>>> was taking 100% CPU.
>>>>>>>> 
>>>>>>>> (*) I have the params noted, since I was actually
>>>>>>>> imagining more of a per-word search index for a PDF.
>>>>>>>> Now looking at you patch, I can even calc the memory
>>>>>>>> requirements. Global word index, 325K words, say 32
>>>>>>>> wchar_t each + int page + sizeof(rectf), is about
>>>>>>>> 32MB - not much by the modern standards. Per unique
>>>>>>>> word it is even less: 20K unique words, about 20 hits
>>>>>>>> per word on average -> char word[32]; { int page;
>>>>>>>> rectf rect } x 20 -> 32*sizeof(wchar_t) + 20*( 4 +
>>>>>>>> 4*sizeof(double)) -> 784 bytes. That multiplied by
>>>>>>>> 20K words: about 16MB. (Plus of course the memory
>>>>>>>> allocation overhead. At this types of structures, it
>>>>>>>> can already bite.)
>>>>>>>> 
>>>>>>>> wbr. _______________________________________________ 
>>>>>>>> poppler mailing list poppler at lists.freedesktop.org 
>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>
>>>>>>>
>>>>>>>> 
_______________________________________________ poppler
>>>>>>> mailing list poppler at lists.freedesktop.org 
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>> 
>>>>> _______________________________________________ poppler
>>>>> mailing list poppler at lists.freedesktop.org 
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>> 
>>>> _______________________________________________ poppler
>>>> mailing list poppler at lists.freedesktop.org 
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
>> _______________________________________________ poppler mailing
>> list poppler at lists.freedesktop.org 
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org 
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP7M35AAoJEPSSjE3STU34NxcH/385EYQj6TcmuFkbdODRINaE
sST4MGANcWiI+Kg/VTihXE1K+cpWZgvld833858Xk0JjT3plpzGSZTA2Ap4eg+WT
Gy0ZUR4weaKeJUfApXDwKxR7lnWlngg/ohdjI3UEi0sj2HMvwIPbki2HYC/tiVnr
jNPy2bffZgl+Y7Zwonc32x6HLDLoddwOH+i2ozgzDozO+Edbz9+2G29Y5nWL1hTC
PnGMGmVqzDnpvxDlHbAM6WXZgioxEk8l3SgnTmhDGX6fnvsQVjIcDDAkgEF7j49X
jYbQ9yXTsKrqINyL7yxfOAyVvgdfbkpvoes5hfRN5YgKVwC+VwTUoWUqcYKaXrM=
=lK1z
-----END PGP SIGNATURE-----


More information about the poppler mailing list