[poppler] poppler_page_get_selected_raw_text() for poppler-glib
Daniel Garcia
danigm at wadobo.com
Sat Jan 8 06:05:19 PST 2011
On Sat, Jan 08, 2011 at 10:26:43AM +0100, Carlos Garcia Campos wrote:
> Thanks for the patch!. Comments inline below
>
> > From 389d49e3413ce09601b574308bd6bbd46044e6b3 Mon Sep 17 00:00:00 2001
> > From: danigm <danigm at wadobo.com>
> > Date: Wed, 5 Jan 2011 14:07:59 +0100
> > Subject: [PATCH] [glib] Added poppler_page_get_raw_text function
> >
> > ---
> > glib/poppler-page.cc | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > glib/poppler-page.h | 1 +
> > 2 files changed, 54 insertions(+), 1 deletions(-)
> >
> > diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc
> > index a8e6b2d..8966f7e 100644
> > --- a/glib/poppler-page.cc
> > +++ b/glib/poppler-page.cc
> > @@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, PopplerRectangle *rect)
> > * This array must be freed with g_free () when done.
> > *
> > * The position in the array represents an offset in the text returned by
> > - * poppler_page_get_text()
> > + * poppler_page_get_raw_text()
>
> Why? if they are compatible is because they return the same, I guess
> get_text_layout() wants the text in reading order.
The thing is get_text_layout now uses wordlist to get the text layout
so it represents the offset returned by get_raw_text, no get_text.
Maybe we should change the name and create a new one with read order
text layout. So we will have get_text_layout and get_raw_text_layout.
>
> > * Return value: %TRUE if the page contains text, %FALSE otherwise
> > *
> > @@ -2200,3 +2200,55 @@ poppler_page_get_text_layout (PopplerPage *page,
> >
> > return TRUE;
> > }
> > +
> > +/**
> > + * poppler_page_get_raw_text:
> > + * @page: A #PopplerPage
>
> You should explain here what raw_text() exactly is, and why it is
> different from get_text().
>
> > + * Return value: a pointer to the text page in raw order
> > + * as a string
>
> This is new API, add Since: 0.18 here and remember to add the symbol
> to glib/reference/poppler-sections.txt
>
> > + **/
> > +char *
> > +poppler_page_get_raw_text (PopplerPage *page)
> > +{
> > + TextPage *text;
> > + TextWordList *wordlist;
> > + TextWord *word, *nextword;
> > + char *craw_text;
> > + GooString *raw_text;
> > + int i = 0;
> > +
> > + raw_text = new GooString();
> > +
> > + g_return_val_if_fail (POPPLER_IS_PAGE (page), FALSE);
>
> s/FALSE/NULL/
>
> > + text = poppler_page_get_text_page (page);
> > + wordlist = text->makeWordList (gFalse);
> > +
> > + if (wordlist->getLength () <= 0)
> > + return NULL;
>
> You are leaking wordlist and raw_text in this early return. Delete the
> wordlist when length <= 0 and create raw_text after the if.
>
> > + for (i = 0; i < wordlist->getLength (); i++)
> > + {
> > + word = wordlist->get (i);
> > + raw_text->append (word->getText ());
>
> word->getText() returns a new allocated GooString and
> GooString::append() copies the given string, so you are leaking the
> GooString here.
>
> > + nextword = word->getNext ();
> > + if (nextword)
> > + {
> > + raw_text->append (' ');
> > + }
> > + else
> > + {
> > + raw_text->append ('\n');
> > + }
>
> Don't use braces for single line clauses. Here you could use something
> like
>
> raw_text->append (nextword ? ' ' : '\n');
>
> > + }
> > +
> > + craw_text = g_strdup (raw_text->getCString ());
>
> We can avoid this g_strdup() by using a GString instead of a
> GooString.
>
> GString *raw_text = g_string_new (NULL);
>
> raw_text = g_string_append_len (raw_text, wordText->getCString(), wordText->getLength());
> raw_text = g_string_append_c (raw_text, nextword ? ' ' : '\n');
>
> craw_text = g_string_free (raw_text, FALSE);
>
> > + delete wordlist;
> > + delete raw_text;
> > +
> > + return craw_text;
> > +}
> > diff --git a/glib/poppler-page.h b/glib/poppler-page.h
> > index d40c0ee..333cb23 100644
> > --- a/glib/poppler-page.h
> > +++ b/glib/poppler-page.h
> > @@ -128,6 +128,7 @@ void poppler_page_get_crop_box (PopplerPage *page,
> > gboolean poppler_page_get_text_layout (PopplerPage *page,
> > PopplerRectangle **rectangles,
> > guint *n_rectangles);
> > +char *poppler_page_get_raw_text (PopplerPage *page);
> >
> > /* A rectangle on a page, with coordinates in PDF points. */
> > #define POPPLER_TYPE_RECTANGLE (poppler_rectangle_get_type ())
I've made all the previous changes. Here's the new patch.
-------------- next part --------------
From bc4cc4cab9ef566643d6b39f15955a75413fd3b8 Mon Sep 17 00:00:00 2001
From: danigm <danigm at wadobo.com>
Date: Wed, 5 Jan 2011 14:07:59 +0100
Subject: [PATCH] [glib] Added poppler_page_get_raw_text function
---
glib/poppler-page.cc | 56 ++++++++++++++++++++++++++++++++++-
glib/poppler-page.h | 1 +
glib/reference/poppler-sections.txt | 1 +
3 files changed, 57 insertions(+), 1 deletions(-)
diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc
index a8e6b2d..cca807c 100644
--- a/glib/poppler-page.cc
+++ b/glib/poppler-page.cc
@@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, PopplerRectangle *rect)
* This array must be freed with g_free () when done.
*
* The position in the array represents an offset in the text returned by
- * poppler_page_get_text()
+ * poppler_page_get_raw_text()
*
* Return value: %TRUE if the page contains text, %FALSE otherwise
*
@@ -2200,3 +2200,57 @@ poppler_page_get_text_layout (PopplerPage *page,
return TRUE;
}
+
+/**
+ * poppler_page_get_raw_text:
+ * @page: A #PopplerPage
+ *
+ * Raw text is the page text, order by its internal position inside
+ * the pdf document.
+ *
+ * Return value: a pointer to the text page in raw order
+ * as a string
+ *
+ * Since: 0.18
+ **/
+char *
+poppler_page_get_raw_text (PopplerPage *page)
+{
+ TextPage *text;
+ TextWordList *wordlist;
+ TextWord *word, *nextword;
+ char *craw_text;
+ GooString *wordText;
+ GString *raw_text;
+ int i = 0;
+
+ g_return_val_if_fail (POPPLER_IS_PAGE (page), NULL);
+
+ text = poppler_page_get_text_page (page);
+ wordlist = text->makeWordList (gFalse);
+
+ if (wordlist->getLength () <= 0)
+ {
+ delete wordlist;
+ return NULL;
+ }
+
+ raw_text = g_string_new (NULL);
+
+ for (i = 0; i < wordlist->getLength (); i++)
+ {
+ word = wordlist->get (i);
+ wordText = word->getText ();
+ raw_text = g_string_append_len (raw_text, wordText->getCString(), wordText->getLength());
+ delete wordText;
+
+ nextword = word->getNext ();
+ raw_text = g_string_append_c (raw_text, nextword ? ' ' : '\n');
+ }
+
+ craw_text = g_string_free (raw_text, FALSE);
+
+ delete wordlist;
+
+ return craw_text;
+}
diff --git a/glib/poppler-page.h b/glib/poppler-page.h
index d40c0ee..333cb23 100644
--- a/glib/poppler-page.h
+++ b/glib/poppler-page.h
@@ -128,6 +128,7 @@ void poppler_page_get_crop_box (PopplerPage *page,
gboolean poppler_page_get_text_layout (PopplerPage *page,
PopplerRectangle **rectangles,
guint *n_rectangles);
+char *poppler_page_get_raw_text (PopplerPage *page);
/* A rectangle on a page, with coordinates in PDF points. */
#define POPPLER_TYPE_RECTANGLE (poppler_rectangle_get_type ())
diff --git a/glib/reference/poppler-sections.txt b/glib/reference/poppler-sections.txt
index 6ad9ee2..bea4dfe 100644
--- a/glib/reference/poppler-sections.txt
+++ b/glib/reference/poppler-sections.txt
@@ -38,6 +38,7 @@ poppler_page_get_selected_text
poppler_page_find_text
poppler_page_get_text
poppler_page_get_text_layout
+poppler_page_get_raw_text
poppler_page_get_link_mapping
poppler_page_free_link_mapping
poppler_page_get_image_mapping
--
1.7.3.4.742.g987cd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110108/f55400c9/attachment.pgp>
More information about the poppler
mailing list