[poppler] a plan to extend poppler-glib to access the raw text

mpsuzuki at hiroshima-u.ac.jp mpsuzuki at hiroshima-u.ac.jp
Mon Sep 6 23:42:31 PDT 2010


Hi,

I want to ask some questions about the internal design of
poppler-glib.

----------------------------------------------------------

Recently Albert accepted my proposal to extend the interface
of TextOutputDev to access the raw text (the layout/position
info is not considered). At present, only poppler-qt4 could
use the extented API, but I don't want to restrict it to
poppler-qt4. I'm trying to extend poppler-glib (and poppler-cpp
in next) to use the extended API.

Checking the internal code how to extract the text from PDF,
there is a difference between poppler-qt4 and poppler-glib.
Adding a few new APIs to enable/disable raw-order mode is
insufficient for poppler-glib to access raw text.

poppler-qt4
-----------
To get the text content from page object, Poppler::Page::text()
is invoked.

In Poppler::Page::text(), TextOutputDev is created,
TextOutputDev::displayPageSlice() is invoked with selection area,
and TextOutputDev::getText() is invoked and GooString is obtained.
Finally, GooString is converted to QString object and returned
to the client.

poppler-glib
------------
To get the text content from page object,
TextOutputDev::getSelectionText() is used.

It dumps the strings collected by TextSelectionVisitor
object. TextSelectionVisitor define 3 methods to eat the text,
visitBlock(), visitLine() and visitWord(). But only visitLine()
method is implemented. Because "line" is defined by the
analysis of the text layout, there is no lines in raw order.

---------------------------------------------------------------

Indepth modification is required to keep the procedure
similarity between poppler-glib's physical-layout mode and
poppler-glib's raw-order mode. Because of no lines can be
defined in raw-ordered mode, using visitLine() for raw-order
mode won't be good idea. Adding the implementation of
visitWord() would be better. Is there any features bounded
to the properties obtained by visitLine()? I don't want to
put a mine that blows the application assuming all text
are collected by visitLine().

Regards,
mpsuzuki


More information about the poppler mailing list