[poppler] RFC: whole-page search in the qt4 frontend
Albert Astals Cid
aacid at kde.org
Thu Jun 28 11:35:27 PDT 2012
El Dijous, 28 de juny de 2012, a les 20:16:30, Adam Reichold va escriure:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
>
> On 28.06.2012 19:36, Albert Astals Cid wrote:
> > El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar `Philips`
> >
> > Filipau va escriure:
> >> On 6/28/12, Adam Reichold <adamreichold at myopera.com> wrote:
> >>> If I remember correctly, some time ago someone proposed caching
> >>> the TextOuputDev/TextPage used in Poppler::Page::search to
> >>> improve performance. Instead, I would propose to add another
> >>> search method to Poppler::Page which searches the whole page at
> >>> once and returns a list of all occurrences. [>>snip<<] Testing
> >>> this with some sample files shows large improvements (above
> >>> 100% as measured by runtime) for searching the whole document
> >>> and especially for short phrases that occur often.
> >>>
> >>> Thanks for any comments and advice. Best regards, Adam.
> >>
> >> That was me. Use-case: I was checking results of conversion of
> >> large PDF into a e-book.
> >>
> >> PDF was 600+ pages long book: 325K words in total, 20K
> >> unique.(*) Problem was (and is) that there is no way to point at
> >> piece of text in the PDF - search was (and is) the only option.
> >> Conversion produced around 200 warnings - and I had to check them
> >> all. Meaning: 200 times searching for a group of words in 600+
> >> page document. IIRC it was taking 6-7 seconds per search in the
> >> Okular (up-to-date version from Debian Sid). (Other PDF viewers
> >> haven't fared better. But the multi-word search is unique to
> >> Okular and was the reason why I used it exclusively.)
> >>
> >> Any speed up would have been extremely helpful. :)
> >
> > This won't help Okular at all.
> >
> > Cheers, Albert
>
> I see. Would you consider including it (if deemed technically fit) anyway?
Including what? Your patch? in poppler or in okular?
Albert
>
> Best regards, Adam.
>
> >> Though the most annoying part was not the waiting time -
> >> checking manually 200+ warnings never going to be fast - it was
> >> that my CPU fan stared spinning up loudly: those 6-7 seconds were
> >> seconds when Okular was taking 100% CPU.
> >>
> >> (*) I have the params noted, since I was actually imagining more
> >> of a per-word search index for a PDF. Now looking at you patch, I
> >> can even calc the memory requirements. Global word index, 325K
> >> words, say 32 wchar_t each + int page + sizeof(rectf), is about
> >> 32MB - not much by the modern standards. Per unique word it is
> >> even less: 20K unique words, about 20 hits per word on average ->
> >> char word[32]; { int page; rectf rect } x 20 ->
> >> 32*sizeof(wchar_t) + 20*( 4 + 4*sizeof(double)) -> 784 bytes.
> >> That multiplied by 20K words: about 16MB. (Plus of course the
> >> memory allocation overhead. At this types of structures, it can
> >> already bite.)
> >>
> >> wbr. _______________________________________________ poppler
> >> mailing list poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >
> > _______________________________________________ poppler mailing
> > list poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.19 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJP7J9+AAoJEPSSjE3STU34OTcIALNR8Byx+m7FSskOzJED7ZRh
> 0xpaJjiMbaxUaxDaRRGrotqaQq6SCQ4PQ0wRDF1Oo3nFTZTwL9Ecwt839g0HnXe9
> Q1GlXkxp2HB55np8CP25oweZYSta1/kLf+g+19Kruuvcyc0iqISvFik9Fax0DrSz
> ap8ZePemZqKMutmrRP0DQVSrqktlMV7M+V6eZccRKibkAi7FJpME0ZTD8HZ36kkt
> Gc9Kt/Eqt+7kWfdabN3qBQYZ/eRJHmz3cm8Br7j93XmmEYYFlWamEvIozIHXC5xO
> otOM6xtVHdH+tEYc1P+cVWwi6AHJi2XyBqnKfIv9Cn/lltd10Vk/NTxJRmOYjRc=
> =giwy
> -----END PGP SIGNATURE-----
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list