[poppler] RFC: whole-page search in the qt4 frontend
Albert Astals Cid
aacid at kde.org
Thu Jun 28 10:36:53 PDT 2012
El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar `Philips` Filipau va
escriure:
> On 6/28/12, Adam Reichold <adamreichold at myopera.com> wrote:
> > If I remember correctly, some time ago someone proposed caching the
> > TextOuputDev/TextPage used in Poppler::Page::search to improve
> > performance. Instead, I would propose to add another search method to
> > Poppler::Page which searches the whole page at once and returns a list
> > of all occurrences.
> > [>>snip<<]
> > Testing this with some sample files shows large improvements (above
> > 100% as measured by runtime) for searching the whole document and
> > especially for short phrases that occur often.
> >
> > Thanks for any comments and advice. Best regards, Adam.
>
> That was me. Use-case: I was checking results of conversion of large
> PDF into a e-book.
>
> PDF was 600+ pages long book: 325K words in total, 20K unique.(*)
> Problem was (and is) that there is no way to point at piece of text in
> the PDF - search was (and is) the only option. Conversion produced
> around 200 warnings - and I had to check them all. Meaning: 200 times
> searching for a group of words in 600+ page document. IIRC it was
> taking 6-7 seconds per search in the Okular (up-to-date version from
> Debian Sid). (Other PDF viewers haven't fared better. But the
> multi-word search is unique to Okular and was the reason why I used it
> exclusively.)
>
> Any speed up would have been extremely helpful. :)
This won't help Okular at all.
Cheers,
Albert
>
> Though the most annoying part was not the waiting time - checking
> manually 200+ warnings never going to be fast - it was that my CPU fan
> stared spinning up loudly: those 6-7 seconds were seconds when Okular
> was taking 100% CPU.
>
> (*) I have the params noted, since I was actually imagining more of a
> per-word search index for a PDF.
> Now looking at you patch, I can even calc the memory requirements.
> Global word index, 325K words, say 32 wchar_t each + int page +
> sizeof(rectf), is about 32MB - not much by the modern standards.
> Per unique word it is even less: 20K unique words, about 20 hits per
> word on average -> char word[32]; { int page; rectf rect } x 20 ->
> 32*sizeof(wchar_t) + 20*( 4 + 4*sizeof(double)) -> 784 bytes. That
> multiplied by 20K words: about 16MB.
> (Plus of course the memory allocation overhead. At this types of
> structures, it can already bite.)
>
> wbr.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list