[poppler] poppler_page_get_selected_raw_text() for poppler-glib

Carlos Garcia Campos carlosgc at gnome.org
Sat Jan 8 01:26:43 PST 2011


Excerpts from Daniel Garcia's message of mié ene 05 14:11:05 +0100 2011:
> On Wed, Sep 22, 2010 at 02:11:31PM +0200, carlosgc wrote:
> > Excerpts from suzuki toshiya's message of mié sep 15 12:16:22 +0200 2010:
> > > Hi,
> > 
> > Hi, 
> > 
> > > Attached patches are the introduction of new API to access raw text.
> > > I wish some maintainer of poppler-glib can review it.
> > 
> > Yes, sorry for the delay. 
> > 
> > > poppler-0.15.0_glib-lib.diff
> > > patch to declare new function and its implementation
> > > 
> > 
> > I prefer poppler_page_get_raw_text(), rather than
> > poppler_page_get_selected_raw_text(), and always return the text of
> > the whole page. I don't see why you might want the selected text in
> > raw order.
> 
> I've made that function. Here's the patch.

Thanks for the patch!. Comments inline below

> From 389d49e3413ce09601b574308bd6bbd46044e6b3 Mon Sep 17 00:00:00 2001
> From: danigm <danigm at wadobo.com>
> Date: Wed, 5 Jan 2011 14:07:59 +0100
> Subject: [PATCH] [glib] Added poppler_page_get_raw_text function
>
> ---
>  glib/poppler-page.cc |   54 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  glib/poppler-page.h  |    1 +
>  2 files changed, 54 insertions(+), 1 deletions(-)
> 
> diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc
> index a8e6b2d..8966f7e 100644
> --- a/glib/poppler-page.cc
> +++ b/glib/poppler-page.cc
> @@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, PopplerRectangle *rect)
>   * This array must be freed with g_free () when done.
>   *
>   * The position in the array represents an offset in the text returned by
> - * poppler_page_get_text()
> + * poppler_page_get_raw_text()

Why? if they are compatible is because they return the same, I guess
get_text_layout() wants the text in reading order. 

>   * Return value: %TRUE if the page contains text, %FALSE otherwise
>   *
> @@ -2200,3 +2200,55 @@ poppler_page_get_text_layout (PopplerPage       *page,
>  
>    return TRUE;
>  }
> +
> +/**
> + * poppler_page_get_raw_text:
> + * @page: A #PopplerPage

You should explain here what raw_text() exactly is, and why it is
different from get_text().

> + * Return value: a pointer to the text page in raw order
> + *               as a string

This is new API, add Since: 0.18 here and remember to add the symbol
to glib/reference/poppler-sections.txt

> + **/
> +char *
> +poppler_page_get_raw_text (PopplerPage *page)
> +{
> +  TextPage *text;
> +  TextWordList *wordlist;
> +  TextWord *word, *nextword;
> +  char *craw_text;
> +  GooString *raw_text;
> +  int i = 0;
> +
> +  raw_text = new GooString();
> +
> +  g_return_val_if_fail (POPPLER_IS_PAGE (page), FALSE);

s/FALSE/NULL/

> +  text = poppler_page_get_text_page (page);
> +  wordlist = text->makeWordList (gFalse);
> +
> +  if (wordlist->getLength () <= 0)
> +    return NULL;

You are leaking wordlist and raw_text in this early return. Delete the
wordlist when length <= 0 and create raw_text after the if.

> +  for (i = 0; i < wordlist->getLength (); i++)
> +  {
> +    word = wordlist->get (i);
> +    raw_text->append (word->getText ());

word->getText() returns a new allocated GooString and
GooString::append() copies the given string, so you are leaking the
GooString here. 

> +    nextword = word->getNext ();
> +    if (nextword)
> +    {
> +      raw_text->append (' ');
> +    }
> +    else
> +    {
> +      raw_text->append ('\n');
> +    }

Don't use braces for single line clauses. Here you could use something
like

raw_text->append (nextword ? ' ' : '\n');

> +  }
> +
> +  craw_text = g_strdup (raw_text->getCString ());

We can avoid this g_strdup() by using a GString instead of a
GooString.

GString *raw_text = g_string_new (NULL);

raw_text = g_string_append_len (raw_text, wordText->getCString(), wordText->getLength());
raw_text = g_string_append_c (raw_text, nextword ? ' ' : '\n');

craw_text = g_string_free (raw_text, FALSE);

> +  delete wordlist;
> +  delete raw_text;
> +
> +  return craw_text;
> +}
> diff --git a/glib/poppler-page.h b/glib/poppler-page.h
> index d40c0ee..333cb23 100644
> --- a/glib/poppler-page.h
> +++ b/glib/poppler-page.h
> @@ -128,6 +128,7 @@ void                       poppler_page_get_crop_box          (PopplerPage        *page,
>  gboolean               poppler_page_get_text_layout      (PopplerPage        *page,
>                                                            PopplerRectangle  **rectangles,
>                                                            guint              *n_rectangles);
> +char                  *poppler_page_get_raw_text         (PopplerPage        *page);
>  
>  /* A rectangle on a page, with coordinates in PDF points. */
>  #define POPPLER_TYPE_RECTANGLE             (poppler_rectangle_get_type ())
-- 
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110108/f0b9dc6d/attachment.pgp>


More information about the poppler mailing list