[poppler] RFC: whole-page search in the qt4 frontend
Adam Reichold
adamreichold at myopera.com
Thu Jun 28 11:38:45 PDT 2012
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 28.06.2012 20:35, Albert Astals Cid wrote:
> El Dijous, 28 de juny de 2012, a les 20:16:30, Adam Reichold va
> escriure: Hello,
>
> On 28.06.2012 19:36, Albert Astals Cid wrote:
>>>> El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar
>>>> `Philips`
>>>>
>>>> Filipau va escriure:
>>>>> On 6/28/12, Adam Reichold <adamreichold at myopera.com>
>>>>> wrote:
>>>>>> If I remember correctly, some time ago someone proposed
>>>>>> caching the TextOuputDev/TextPage used in
>>>>>> Poppler::Page::search to improve performance. Instead, I
>>>>>> would propose to add another search method to
>>>>>> Poppler::Page which searches the whole page at once and
>>>>>> returns a list of all occurrences. [>>snip<<] Testing
>>>>>> this with some sample files shows large improvements
>>>>>> (above 100% as measured by runtime) for searching the
>>>>>> whole document and especially for short phrases that
>>>>>> occur often.
>>>>>>
>>>>>> Thanks for any comments and advice. Best regards, Adam.
>>>>>
>>>>> That was me. Use-case: I was checking results of conversion
>>>>> of large PDF into a e-book.
>>>>>
>>>>> PDF was 600+ pages long book: 325K words in total, 20K
>>>>> unique.(*) Problem was (and is) that there is no way to
>>>>> point at piece of text in the PDF - search was (and is) the
>>>>> only option. Conversion produced around 200 warnings - and
>>>>> I had to check them all. Meaning: 200 times searching for a
>>>>> group of words in 600+ page document. IIRC it was taking
>>>>> 6-7 seconds per search in the Okular (up-to-date version
>>>>> from Debian Sid). (Other PDF viewers haven't fared better.
>>>>> But the multi-word search is unique to Okular and was the
>>>>> reason why I used it exclusively.)
>>>>>
>>>>> Any speed up would have been extremely helpful. :)
>>>>
>>>> This won't help Okular at all.
>>>>
>>>> Cheers, Albert
>
> I see. Would you consider including it (if deemed technically fit)
> anyway?
>
>> Including what? Your patch? in poppler or in okular?
The patch adding the whole-page search function to the qt4 frontend of
poppler.
Best regards, Adam.
>> Albert
>
>
> Best regards, Adam.
>
>>>>> Though the most annoying part was not the waiting time -
>>>>> checking manually 200+ warnings never going to be fast - it
>>>>> was that my CPU fan stared spinning up loudly: those 6-7
>>>>> seconds were seconds when Okular was taking 100% CPU.
>>>>>
>>>>> (*) I have the params noted, since I was actually imagining
>>>>> more of a per-word search index for a PDF. Now looking at
>>>>> you patch, I can even calc the memory requirements. Global
>>>>> word index, 325K words, say 32 wchar_t each + int page +
>>>>> sizeof(rectf), is about 32MB - not much by the modern
>>>>> standards. Per unique word it is even less: 20K unique
>>>>> words, about 20 hits per word on average -> char word[32];
>>>>> { int page; rectf rect } x 20 -> 32*sizeof(wchar_t) + 20*(
>>>>> 4 + 4*sizeof(double)) -> 784 bytes. That multiplied by 20K
>>>>> words: about 16MB. (Plus of course the memory allocation
>>>>> overhead. At this types of structures, it can already
>>>>> bite.)
>>>>>
>>>>> wbr. _______________________________________________
>>>>> poppler mailing list poppler at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>> _______________________________________________ poppler
>>>> mailing list poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>> _______________________________________________ poppler mailing
>> list poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJP7KS1AAoJEPSSjE3STU34bJ8H/2EO8kujLKhlKLcfJ3Cf6G1G
Dfcpyut1OJ2ZZbTegXK0H6jqHHoakYrstCuENfwEgXiFVU9I9e4G7mb7YbO12f/A
sXQ4Az68/3AJoVAqGX2KBD8DDOnDGhi5Ug4kXLjRtnoi7tLdiYMsCZQJJwPVaQij
9GhvwydCs0ZZyp1UH0UFGTz9Y3eL5ildPZpVcqx+ifG69FxJYGCE6/kWE6Pp928Z
Gx8mHvq2HamwIPKWtu728iOayHOiG/cNoA/PfIHhYs9BBLjoVYIzZ6/7MCYM6rXj
WZg2ZW27P9nbU5EqHCS9iWrSlXyQ3UwByT4y/1bTdzNx/8De2eWAgtFOj0/1ck0=
=Oxcd
-----END PGP SIGNATURE-----
More information about the poppler
mailing list