[poppler] RFC: whole-page search in the qt4 frontend

Ihar `Philips` Filipau thephilips at gmail.com
Thu Jun 28 09:54:40 PDT 2012


On 6/28/12, Adam Reichold <adamreichold at myopera.com> wrote:
>
> If I remember correctly, some time ago someone proposed caching the
> TextOuputDev/TextPage used in Poppler::Page::search to improve
> performance. Instead, I would propose to add another search method to
> Poppler::Page which searches the whole page at once and returns a list
> of all occurrences.
> [>>snip<<]
> Testing this with some sample files shows large improvements (above
> 100% as measured by runtime) for searching the whole document and
> especially for short phrases that occur often.
>
> Thanks for any comments and advice. Best regards, Adam.

That was me. Use-case: I was checking results of conversion of large
PDF into a e-book.

PDF was 600+ pages long book: 325K words in total, 20K unique.(*)
Problem was (and is) that there is no way to point at piece of text in
the PDF - search was (and is) the only option. Conversion produced
around 200 warnings - and I had to check them all. Meaning: 200 times
searching for a group of words in 600+ page document. IIRC it was
taking 6-7 seconds per search in the Okular (up-to-date version from
Debian Sid). (Other PDF viewers haven't fared better. But the
multi-word search is unique to Okular and was the reason why I used it
exclusively.)

Any speed up would have been extremely helpful. :)

Though the most annoying part was not the waiting time - checking
manually 200+ warnings never going to be fast - it was that my CPU fan
stared spinning up loudly: those 6-7 seconds were seconds when Okular
was taking 100% CPU.

(*) I have the params noted, since I was actually imagining more of a
per-word search index for a PDF.
Now looking at you patch, I can even calc the memory requirements.
Global word index, 325K words, say 32 wchar_t each + int page +
sizeof(rectf), is about 32MB - not much by the modern standards.
Per unique word it is even less: 20K unique words, about 20 hits per
word on average -> char word[32]; { int page; rectf rect } x 20 ->
32*sizeof(wchar_t) + 20*( 4 + 4*sizeof(double)) -> 784 bytes. That
multiplied by 20K words: about 16MB.
(Plus of course the memory allocation overhead. At this types of
structures, it can already bite.)

wbr.


More information about the poppler mailing list