[poppler] RFC: whole-page search in the qt4 frontend

Adam Reichold adamreichold at myopera.com
Thu Jun 28 11:16:30 PDT 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

On 28.06.2012 19:36, Albert Astals Cid wrote:
> El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar `Philips`
> Filipau va escriure:
>> On 6/28/12, Adam Reichold <adamreichold at myopera.com> wrote:
>>> If I remember correctly, some time ago someone proposed caching
>>> the TextOuputDev/TextPage used in Poppler::Page::search to
>>> improve performance. Instead, I would propose to add another
>>> search method to Poppler::Page which searches the whole page at
>>> once and returns a list of all occurrences. [>>snip<<] Testing
>>> this with some sample files shows large improvements (above 
>>> 100% as measured by runtime) for searching the whole document
>>> and especially for short phrases that occur often.
>>> 
>>> Thanks for any comments and advice. Best regards, Adam.
>> 
>> That was me. Use-case: I was checking results of conversion of
>> large PDF into a e-book.
>> 
>> PDF was 600+ pages long book: 325K words in total, 20K
>> unique.(*) Problem was (and is) that there is no way to point at
>> piece of text in the PDF - search was (and is) the only option.
>> Conversion produced around 200 warnings - and I had to check them
>> all. Meaning: 200 times searching for a group of words in 600+
>> page document. IIRC it was taking 6-7 seconds per search in the
>> Okular (up-to-date version from Debian Sid). (Other PDF viewers
>> haven't fared better. But the multi-word search is unique to
>> Okular and was the reason why I used it exclusively.)
>> 
>> Any speed up would have been extremely helpful. :)
> 
> This won't help Okular at all.
> 
> Cheers, Albert

I see. Would you consider including it (if deemed technically fit) anyway?

Best regards, Adam.

>> 
>> Though the most annoying part was not the waiting time -
>> checking manually 200+ warnings never going to be fast - it was
>> that my CPU fan stared spinning up loudly: those 6-7 seconds were
>> seconds when Okular was taking 100% CPU.
>> 
>> (*) I have the params noted, since I was actually imagining more
>> of a per-word search index for a PDF. Now looking at you patch, I
>> can even calc the memory requirements. Global word index, 325K
>> words, say 32 wchar_t each + int page + sizeof(rectf), is about
>> 32MB - not much by the modern standards. Per unique word it is
>> even less: 20K unique words, about 20 hits per word on average ->
>> char word[32]; { int page; rectf rect } x 20 -> 
>> 32*sizeof(wchar_t) + 20*( 4 + 4*sizeof(double)) -> 784 bytes.
>> That multiplied by 20K words: about 16MB. (Plus of course the
>> memory allocation overhead. At this types of structures, it can
>> already bite.)
>> 
>> wbr. _______________________________________________ poppler
>> mailing list poppler at lists.freedesktop.org 
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org 
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP7J9+AAoJEPSSjE3STU34OTcIALNR8Byx+m7FSskOzJED7ZRh
0xpaJjiMbaxUaxDaRRGrotqaQq6SCQ4PQ0wRDF1Oo3nFTZTwL9Ecwt839g0HnXe9
Q1GlXkxp2HB55np8CP25oweZYSta1/kLf+g+19Kruuvcyc0iqISvFik9Fax0DrSz
ap8ZePemZqKMutmrRP0DQVSrqktlMV7M+V6eZccRKibkAi7FJpME0ZTD8HZ36kkt
Gc9Kt/Eqt+7kWfdabN3qBQYZ/eRJHmz3cm8Br7j93XmmEYYFlWamEvIozIHXC5xO
otOM6xtVHdH+tEYc1P+cVWwi6AHJi2XyBqnKfIv9Cn/lltd10Vk/NTxJRmOYjRc=
=giwy
-----END PGP SIGNATURE-----


More information about the poppler mailing list