[poppler] RFC: whole-page search in the qt4 frontend

Thu Jun 28 10:26:30 PDT 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello.

On 28.06.2012 18:54, Ihar `Philips` Filipau wrote:
> On 6/28/12, Adam Reichold <adamreichold at myopera.com> wrote:
>> 
>> If I remember correctly, some time ago someone proposed caching
>> the TextOuputDev/TextPage used in Poppler::Page::search to
>> improve performance. Instead, I would propose to add another
>> search method to Poppler::Page which searches the whole page at
>> once and returns a list of all occurrences. [>>snip<<] Testing
>> this with some sample files shows large improvements (above 100%
>> as measured by runtime) for searching the whole document and 
>> especially for short phrases that occur often.
>> 
>> Thanks for any comments and advice. Best regards, Adam.
> 
> That was me. Use-case: I was checking results of conversion of
> large PDF into a e-book.
> 
> 
> PDF was 600+ pages long book: 325K words in total, 20K unique.(*) 
> Problem was (and is) that there is no way to point at piece of text
> in the PDF - search was (and is) the only option. Conversion
> produced around 200 warnings - and I had to check them all.
> Meaning: 200 times searching for a group of words in 600+ page
> document. IIRC it was taking 6-7 seconds per search in the Okular
> (up-to-date version from Debian Sid). (Other PDF viewers haven't
> fared better. But the multi-word search is unique to Okular and was
> the reason why I used it exclusively.)
> 
> Any speed up would have been extremely helpful. :)

I think that whether this yields any speed improvement depends on how
applications actually implement search. (I think Albert Astals Cid
said that Okular does not use Poppler::Page::search at all, didn't
he?) So I am unsure about how much this would have helped in that
particular situation.

In qpdfview, which I maintain, the search does always run through the
whole document in the background caching all results so that
navigating them is practically free in terms of runtime. So in some
sense, this  method would have exactly the right granularity for this
way of doing this. But I have no idea of how helpful this would be for
Okular especially considering its more advanced features.

> Though the most annoying part was not the waiting time - checking 
> manually 200+ warnings never going to be fast - it was that my CPU
> fan stared spinning up loudly: those 6-7 seconds were seconds when
> Okular was taking 100% CPU.
> 
> (*) I have the params noted, since I was actually imagining more of
> a per-word search index for a PDF. Now looking at you patch, I can
> even calc the memory requirements. Global word index, 325K words,
> say 32 wchar_t each + int page + sizeof(rectf), is about 32MB - not
> much by the modern standards. Per unique word it is even less: 20K
> unique words, about 20 hits per word on average -> char word[32]; {
> int page; rectf rect } x 20 -> 32*sizeof(wchar_t) + 20*( 4 +
> 4*sizeof(double)) -> 784 bytes. That multiplied by 20K words: about
> 16MB. (Plus of course the memory allocation overhead. At this types
> of structures, it can already bite.)

Someone proposed to create full-text indices for documents on opening
on the qpdfview mailing list as well. But even though it isn't much by
modern standards, I am quite reluctant to add such persistent memory
pressure. Especially since a user could open a lot of documents or
instances of the program.

> wbr.
> 

Best regards, Adam.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP7JPGAAoJEPSSjE3STU34MFMH/iYj7HJkOoMdGh2XCx0RGoq0
ExmCJMvgmPAkF2XOamHFyX0lpcStt4Ez0/eh+uyjKb5fOVuNhpgOjqiBqRPPCJb1
x+tAxgwYFkm6dWAyXOyNF0DR8V9q5Bwf1XwBi6Mp01GeThZSXa49gjcYDB2PW0Lq
jJJlnCIHmVSZWi0HkKi9Y8SxBiERWX3sQwlv9jYWR3U/l73JQ+qUG4+LnhpCPasO
74L/Di052Jk7G1s/RL5+kpjv7o44u1kZpzC4JmaRKzLegizW2M25GsbmIxhqtg1a
4LujTBFOD8pPcQdIiA5zKe31Zebi5MRAKPXTX7j+scCFrn84G5OxKpLAH6daw2Y=
=OG6m
-----END PGP SIGNATURE-----