[poppler] RFC: whole-page search in the qt4 frontend
Albert Astals Cid
aacid at kde.org
Thu Jun 28 13:55:12 PDT 2012
El Dijous, 28 de juny de 2012, a les 20:38:45, Adam Reichold va escriure:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 28.06.2012 20:35, Albert Astals Cid wrote:
> > El Dijous, 28 de juny de 2012, a les 20:16:30, Adam Reichold va
> > escriure: Hello,
> >
> > On 28.06.2012 19:36, Albert Astals Cid wrote:
> >>>> El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar
> >>>> `Philips`
> >>>>
> >>>> Filipau va escriure:
> >>>>> On 6/28/12, Adam Reichold <adamreichold at myopera.com>
> >>>>>
> >>>>> wrote:
> >>>>>> If I remember correctly, some time ago someone proposed
> >>>>>> caching the TextOuputDev/TextPage used in
> >>>>>> Poppler::Page::search to improve performance. Instead, I
> >>>>>> would propose to add another search method to
> >>>>>> Poppler::Page which searches the whole page at once and
> >>>>>> returns a list of all occurrences. [>>snip<<] Testing
> >>>>>> this with some sample files shows large improvements
> >>>>>> (above 100% as measured by runtime) for searching the
> >>>>>> whole document and especially for short phrases that
> >>>>>> occur often.
> >>>>>>
> >>>>>> Thanks for any comments and advice. Best regards, Adam.
> >>>>>
> >>>>> That was me. Use-case: I was checking results of conversion
> >>>>> of large PDF into a e-book.
> >>>>>
> >>>>> PDF was 600+ pages long book: 325K words in total, 20K
> >>>>> unique.(*) Problem was (and is) that there is no way to
> >>>>> point at piece of text in the PDF - search was (and is) the
> >>>>> only option. Conversion produced around 200 warnings - and
> >>>>> I had to check them all. Meaning: 200 times searching for a
> >>>>> group of words in 600+ page document. IIRC it was taking
> >>>>> 6-7 seconds per search in the Okular (up-to-date version
> >>>>> from Debian Sid). (Other PDF viewers haven't fared better.
> >>>>> But the multi-word search is unique to Okular and was the
> >>>>> reason why I used it exclusively.)
> >>>>>
> >>>>> Any speed up would have been extremely helpful. :)
> >>>>
> >>>> This won't help Okular at all.
> >>>>
> >>>> Cheers, Albert
> >
> > I see. Would you consider including it (if deemed technically fit)
> > anyway?
> >
> >> Including what? Your patch? in poppler or in okular?
>
> The patch adding the whole-page search function to the qt4 frontend of
> poppler.
I'll have a look at that asap (which might not be very soon) but i don't see
why not if the code is fine.
Cheers,
Albert
>
> Best regards, Adam.
>
> >> Albert
> >
> > Best regards, Adam.
> >
> >>>>> Though the most annoying part was not the waiting time -
> >>>>> checking manually 200+ warnings never going to be fast - it
> >>>>> was that my CPU fan stared spinning up loudly: those 6-7
> >>>>> seconds were seconds when Okular was taking 100% CPU.
> >>>>>
> >>>>> (*) I have the params noted, since I was actually imagining
> >>>>> more of a per-word search index for a PDF. Now looking at
> >>>>> you patch, I can even calc the memory requirements. Global
> >>>>> word index, 325K words, say 32 wchar_t each + int page +
> >>>>> sizeof(rectf), is about 32MB - not much by the modern
> >>>>> standards. Per unique word it is even less: 20K unique
> >>>>> words, about 20 hits per word on average -> char word[32];
> >>>>> { int page; rectf rect } x 20 -> 32*sizeof(wchar_t) + 20*(
> >>>>> 4 + 4*sizeof(double)) -> 784 bytes. That multiplied by 20K
> >>>>> words: about 16MB. (Plus of course the memory allocation
> >>>>> overhead. At this types of structures, it can already
> >>>>> bite.)
> >>>>>
> >>>>> wbr. _______________________________________________
> >>>>> poppler mailing list poppler at lists.freedesktop.org
> >>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>
> >>>> _______________________________________________ poppler
> >>>> mailing list poppler at lists.freedesktop.org
> >>>> http://lists.freedesktop.org/mailman/listinfo/poppler
> >>
> >> _______________________________________________ poppler mailing
> >> list poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >
> > _______________________________________________ poppler mailing
> > list poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.19 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJP7KS1AAoJEPSSjE3STU34bJ8H/2EO8kujLKhlKLcfJ3Cf6G1G
> Dfcpyut1OJ2ZZbTegXK0H6jqHHoakYrstCuENfwEgXiFVU9I9e4G7mb7YbO12f/A
> sXQ4Az68/3AJoVAqGX2KBD8DDOnDGhi5Ug4kXLjRtnoi7tLdiYMsCZQJJwPVaQij
> 9GhvwydCs0ZZyp1UH0UFGTz9Y3eL5ildPZpVcqx+ifG69FxJYGCE6/kWE6Pp928Z
> Gx8mHvq2HamwIPKWtu728iOayHOiG/cNoA/PfIHhYs9BBLjoVYIzZ6/7MCYM6rXj
> WZg2ZW27P9nbU5EqHCS9iWrSlXyQ3UwByT4y/1bTdzNx/8De2eWAgtFOj0/1ck0=
> =Oxcd
> -----END PGP SIGNATURE-----
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list