[poppler] RFC: whole-page search in the qt4 frontend

Adam Reichold adamreichold at myopera.com
Thu Jun 28 11:38:45 PDT 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 28.06.2012 20:35, Albert Astals Cid wrote:
> El Dijous, 28 de juny de 2012, a les 20:16:30, Adam Reichold va
> escriure: Hello,
> 
> On 28.06.2012 19:36, Albert Astals Cid wrote:
>>>> El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar
>>>> `Philips`
>>>> 
>>>> Filipau va escriure:
>>>>> On 6/28/12, Adam Reichold <adamreichold at myopera.com>
>>>>> wrote:
>>>>>> If I remember correctly, some time ago someone proposed
>>>>>> caching the TextOuputDev/TextPage used in
>>>>>> Poppler::Page::search to improve performance. Instead, I
>>>>>> would propose to add another search method to
>>>>>> Poppler::Page which searches the whole page at once and
>>>>>> returns a list of all occurrences. [>>snip<<] Testing 
>>>>>> this with some sample files shows large improvements
>>>>>> (above 100% as measured by runtime) for searching the
>>>>>> whole document and especially for short phrases that
>>>>>> occur often.
>>>>>> 
>>>>>> Thanks for any comments and advice. Best regards, Adam.
>>>>> 
>>>>> That was me. Use-case: I was checking results of conversion
>>>>> of large PDF into a e-book.
>>>>> 
>>>>> PDF was 600+ pages long book: 325K words in total, 20K 
>>>>> unique.(*) Problem was (and is) that there is no way to
>>>>> point at piece of text in the PDF - search was (and is) the
>>>>> only option. Conversion produced around 200 warnings - and
>>>>> I had to check them all. Meaning: 200 times searching for a
>>>>> group of words in 600+ page document. IIRC it was taking
>>>>> 6-7 seconds per search in the Okular (up-to-date version
>>>>> from Debian Sid). (Other PDF viewers haven't fared better.
>>>>> But the multi-word search is unique to Okular and was the
>>>>> reason why I used it exclusively.)
>>>>> 
>>>>> Any speed up would have been extremely helpful. :)
>>>> 
>>>> This won't help Okular at all.
>>>> 
>>>> Cheers, Albert
> 
> I see. Would you consider including it (if deemed technically fit)
> anyway?
> 
>> Including what? Your patch? in poppler or in okular?

The patch adding the whole-page search function to the qt4 frontend of
poppler.

Best regards, Adam.

>> Albert
> 
> 
> Best regards, Adam.
> 
>>>>> Though the most annoying part was not the waiting time - 
>>>>> checking manually 200+ warnings never going to be fast - it
>>>>> was that my CPU fan stared spinning up loudly: those 6-7
>>>>> seconds were seconds when Okular was taking 100% CPU.
>>>>> 
>>>>> (*) I have the params noted, since I was actually imagining
>>>>> more of a per-word search index for a PDF. Now looking at
>>>>> you patch, I can even calc the memory requirements. Global
>>>>> word index, 325K words, say 32 wchar_t each + int page +
>>>>> sizeof(rectf), is about 32MB - not much by the modern
>>>>> standards. Per unique word it is even less: 20K unique
>>>>> words, about 20 hits per word on average -> char word[32];
>>>>> { int page; rectf rect } x 20 -> 32*sizeof(wchar_t) + 20*(
>>>>> 4 + 4*sizeof(double)) -> 784 bytes. That multiplied by 20K
>>>>> words: about 16MB. (Plus of course the memory allocation
>>>>> overhead. At this types of structures, it can already
>>>>> bite.)
>>>>> 
>>>>> wbr. _______________________________________________
>>>>> poppler mailing list poppler at lists.freedesktop.org 
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>> 
>>>> _______________________________________________ poppler
>>>> mailing list poppler at lists.freedesktop.org 
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
>> _______________________________________________ poppler mailing
>> list poppler at lists.freedesktop.org 
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org 
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP7KS1AAoJEPSSjE3STU34bJ8H/2EO8kujLKhlKLcfJ3Cf6G1G
Dfcpyut1OJ2ZZbTegXK0H6jqHHoakYrstCuENfwEgXiFVU9I9e4G7mb7YbO12f/A
sXQ4Az68/3AJoVAqGX2KBD8DDOnDGhi5Ug4kXLjRtnoi7tLdiYMsCZQJJwPVaQij
9GhvwydCs0ZZyp1UH0UFGTz9Y3eL5ildPZpVcqx+ifG69FxJYGCE6/kWE6Pp928Z
Gx8mHvq2HamwIPKWtu728iOayHOiG/cNoA/PfIHhYs9BBLjoVYIzZ6/7MCYM6rXj
WZg2ZW27P9nbU5EqHCS9iWrSlXyQ3UwByT4y/1bTdzNx/8De2eWAgtFOj0/1ck0=
=Oxcd
-----END PGP SIGNATURE-----


More information about the poppler mailing list