[Poppler-bugs] [Bug 28053] poppler is too slow when searching this file

Fri Apr 22 03:07:28 PDT 2011

https://bugs.freedesktop.org/show_bug.cgi?id=28053

--- Comment #8 from Sebastian Kums <quirks1 at web.de> 2011-04-22 03:07:28 PDT ---
I had a look at this recently, because I am heavily affected by this bug, since
a search a lot in big PDF documents (1000+ pages). It takes 2-3 minutes. As you
probably know, the reason for this, is that poppler needs to render the page
(at least the text portion) in order to extract the text of every page. After
the search, the rendered page is discarded in order to conserve memory (I
guess). I wrote a little proof-of-concept enhancement, which reduces the search
duration to less than 5 seconds in a 1000 page document for consecutive (!)
searches. Please mind that this is really just a proof of concept! I used
pretty dirty programming techniques, because (a) I have hardly any time and (b)
I am not a good C programmer. But all I want is to propose an idea. Here is how
it works:

I introduced a static variable in glib/poppler-page.cc:
poppler_page_find_text() named text_cache. The contents of this variable are
never deleted, since it is static. (Again, there are probably more elegant
methods to achieve this, but this is only a demo.) The program flow is as
follows:

1. evince calls poppler_page_find_text() to search for text in a page
   given as an argument.
2. poppler_page_find_text() now checks, if the page to be searched is
   already in the cache (i.e., the variable text_cache).
2.a. The first time that the given page is searched, this will not be
     case, because the page has never been rendered before. Therefore,
     the program flow is as usual:
2.a.1. The page is rendered to text_dev.
2.a.2. Then, text_dev->findText() is called to search for the text.
2.a.3. I made a little addition at this point: while the rendered page is 
       usually discarded after this operation, I write the plain text to
       text_cache by calling text_dev->cacheText() - a function I added.
       Only after this, the rendered page is discarded. Please
       note that text_cache only contains the plain text, which is a lot
       smaller in memory than the whole rendition of the page (a few hundred 
       bytes vs. several hundred kB per page).
2.b. The next time, the user performs a search on this very page, the plain
     text of the page will be found in the text_cache.
     poppler_page_find_text() therefore calls a function added by me named 
     poppler_page_scan_text_cache(), which searches the text_cache for
     the given keyword. poppler_page_scan_text_cache() only tells, IF the
     the keyword is contained in the page. It cannot tell WHERE it is
     located. However, this result is returned very fast, because the
     page does not need to be rendered.
2.b.a. If the text was not found on the page, then poppler_page_find_text()
       aborts immediately - there is no need to render the page, because
       it can be assumed that the text is not contained in the page.
2.b.b. If the text was found in the cached text page, then the page is
       rendered as usual and text_dev->findText is called to determine
       the exact locations of the text on the page.

The big performance improvement is made in 2.b.a. poppler saves the effort of
rendering a whole lot of pages, because before rendering, it checks if the text
is contained in the page at all (which is very fast). Pages that do not contain
the text are skipped.

Of course, the first time that the document is opened, the search will be slow
as usual, because the text cache is empty. But every consecutive search is way
faster, because only pages containing the search string are rendered. I believe
that acroread even goes as far as saving the text_cache to disk so that it can
be restored in the future without the need to render the whole document again.
I did not implement this, though.

To sum it up:
Pros:
- Dramatically improved search speed for consecutive searches (obviously).
Cons:
- Minimal memory overhead to cache text of document (negligible, IMO).
- The first search is still slow (could be improved by saving text_cache to
disk and restoring it, next time the document is opened).
- If the search string is found on all pages, the search is still slow, because
the text_cache produces only hits and all pages are rendered. This is a
different story, though. Ideally evince should not render any but the currently
displayed page. It would be sufficient, if poppler returned only the page
numbers that contain the text in question and not the exact location of all
occurrences. Only when the user goes to a page with a hit, the page should be
rendered. But then again, why would I search a document for a string, which is
found on every page anyway?

I will attach my adapted source. The source is for poppler 0.12.4. It is old -
I know - but this is current on my distro (Ubuntu 10.04). The patch can be
easily adapted to the most recent version of poppler, though.

Files:
- poppler/TextOutputDev.h: added "struct TextCachePage" and
"TextOutputDev::cacheText()"
- poppler/TextOutputDev.cc: added "TextOutputDev::cacheText()"
- glib/poppler-page.cc: added "poppler_page_scan_text_cache()" and modified
"poppler_page_find_text()"

P.S. I can totally understand, if the poppler developers say, this is crap. I
just want to present an idea, which others might benefit from, too. If you
don't like it, feel free to toss it. I am using it happily.
Also, sorry for the lengthy post.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.