[poppler] a plan to extend poppler-glib to access the raw text
carlosgc
carlosgc at gnome.org
Tue Sep 7 00:04:13 PDT 2010
Excerpts from mpsuzuki's message of mar sep 07 08:42:31 +0200 2010:
> Hi,
>
> I want to ask some questions about the internal design of
> poppler-glib.
>
> ----------------------------------------------------------
>
> Recently Albert accepted my proposal to extend the interface
> of TextOutputDev to access the raw text (the layout/position
> info is not considered). At present, only poppler-qt4 could
> use the extented API, but I don't want to restrict it to
> poppler-qt4. I'm trying to extend poppler-glib (and poppler-cpp
> in next) to use the extended API.
>
> Checking the internal code how to extract the text from PDF,
> there is a difference between poppler-qt4 and poppler-glib.
> Adding a few new APIs to enable/disable raw-order mode is
> insufficient for poppler-glib to access raw text.
>
> poppler-qt4
> -----------
> To get the text content from page object, Poppler::Page::text()
> is invoked.
>
> In Poppler::Page::text(), TextOutputDev is created,
> TextOutputDev::displayPageSlice() is invoked with selection area,
> and TextOutputDev::getText() is invoked and GooString is obtained.
> Finally, GooString is converted to QString object and returned
> to the client.
>
> poppler-glib
> ------------
> To get the text content from page object,
> TextOutputDev::getSelectionText() is used.
>
> It dumps the strings collected by TextSelectionVisitor
> object. TextSelectionVisitor define 3 methods to eat the text,
> visitBlock(), visitLine() and visitWord(). But only visitLine()
> method is implemented. Because "line" is defined by the
> analysis of the text layout, there is no lines in raw order.
>
Why not simply use TextOutputDev::getText() like qt4 frontend does?
TextOutputDev::getSelectionText() is meant for selections, but you
don't want text in raw order for selections. I would just add a new
method gchar *poppler_page_get_raw_text (PopplerPage *page);
Regards,
--
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20100907/e14055a7/attachment.pgp>
More information about the poppler
mailing list