[poppler] poppler_page_get_selected_raw_text() for poppler-glib
Carlos Garcia Campos
carlosgc at gnome.org
Sat Jan 8 01:26:43 PST 2011
Excerpts from Daniel Garcia's message of mié ene 05 14:11:05 +0100 2011:
> On Wed, Sep 22, 2010 at 02:11:31PM +0200, carlosgc wrote:
> > Excerpts from suzuki toshiya's message of mié sep 15 12:16:22 +0200 2010:
> > > Hi,
> >
> > Hi,
> >
> > > Attached patches are the introduction of new API to access raw text.
> > > I wish some maintainer of poppler-glib can review it.
> >
> > Yes, sorry for the delay.
> >
> > > poppler-0.15.0_glib-lib.diff
> > > patch to declare new function and its implementation
> > >
> >
> > I prefer poppler_page_get_raw_text(), rather than
> > poppler_page_get_selected_raw_text(), and always return the text of
> > the whole page. I don't see why you might want the selected text in
> > raw order.
>
> I've made that function. Here's the patch.
Thanks for the patch!. Comments inline below
> From 389d49e3413ce09601b574308bd6bbd46044e6b3 Mon Sep 17 00:00:00 2001
> From: danigm <danigm at wadobo.com>
> Date: Wed, 5 Jan 2011 14:07:59 +0100
> Subject: [PATCH] [glib] Added poppler_page_get_raw_text function
>
> ---
> glib/poppler-page.cc | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
> glib/poppler-page.h | 1 +
> 2 files changed, 54 insertions(+), 1 deletions(-)
>
> diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc
> index a8e6b2d..8966f7e 100644
> --- a/glib/poppler-page.cc
> +++ b/glib/poppler-page.cc
> @@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, PopplerRectangle *rect)
> * This array must be freed with g_free () when done.
> *
> * The position in the array represents an offset in the text returned by
> - * poppler_page_get_text()
> + * poppler_page_get_raw_text()
Why? if they are compatible is because they return the same, I guess
get_text_layout() wants the text in reading order.
> * Return value: %TRUE if the page contains text, %FALSE otherwise
> *
> @@ -2200,3 +2200,55 @@ poppler_page_get_text_layout (PopplerPage *page,
>
> return TRUE;
> }
> +
> +/**
> + * poppler_page_get_raw_text:
> + * @page: A #PopplerPage
You should explain here what raw_text() exactly is, and why it is
different from get_text().
> + * Return value: a pointer to the text page in raw order
> + * as a string
This is new API, add Since: 0.18 here and remember to add the symbol
to glib/reference/poppler-sections.txt
> + **/
> +char *
> +poppler_page_get_raw_text (PopplerPage *page)
> +{
> + TextPage *text;
> + TextWordList *wordlist;
> + TextWord *word, *nextword;
> + char *craw_text;
> + GooString *raw_text;
> + int i = 0;
> +
> + raw_text = new GooString();
> +
> + g_return_val_if_fail (POPPLER_IS_PAGE (page), FALSE);
s/FALSE/NULL/
> + text = poppler_page_get_text_page (page);
> + wordlist = text->makeWordList (gFalse);
> +
> + if (wordlist->getLength () <= 0)
> + return NULL;
You are leaking wordlist and raw_text in this early return. Delete the
wordlist when length <= 0 and create raw_text after the if.
> + for (i = 0; i < wordlist->getLength (); i++)
> + {
> + word = wordlist->get (i);
> + raw_text->append (word->getText ());
word->getText() returns a new allocated GooString and
GooString::append() copies the given string, so you are leaking the
GooString here.
> + nextword = word->getNext ();
> + if (nextword)
> + {
> + raw_text->append (' ');
> + }
> + else
> + {
> + raw_text->append ('\n');
> + }
Don't use braces for single line clauses. Here you could use something
like
raw_text->append (nextword ? ' ' : '\n');
> + }
> +
> + craw_text = g_strdup (raw_text->getCString ());
We can avoid this g_strdup() by using a GString instead of a
GooString.
GString *raw_text = g_string_new (NULL);
raw_text = g_string_append_len (raw_text, wordText->getCString(), wordText->getLength());
raw_text = g_string_append_c (raw_text, nextword ? ' ' : '\n');
craw_text = g_string_free (raw_text, FALSE);
> + delete wordlist;
> + delete raw_text;
> +
> + return craw_text;
> +}
> diff --git a/glib/poppler-page.h b/glib/poppler-page.h
> index d40c0ee..333cb23 100644
> --- a/glib/poppler-page.h
> +++ b/glib/poppler-page.h
> @@ -128,6 +128,7 @@ void poppler_page_get_crop_box (PopplerPage *page,
> gboolean poppler_page_get_text_layout (PopplerPage *page,
> PopplerRectangle **rectangles,
> guint *n_rectangles);
> +char *poppler_page_get_raw_text (PopplerPage *page);
>
> /* A rectangle on a page, with coordinates in PDF points. */
> #define POPPLER_TYPE_RECTANGLE (poppler_rectangle_get_type ())
--
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110108/f0b9dc6d/attachment.pgp>
More information about the poppler
mailing list