[poppler] poppler_page_get_selected_raw_text() for poppler-glib

Daniel Garcia danigm at wadobo.com
Wed Jan 5 05:11:05 PST 2011


On Wed, Sep 22, 2010 at 02:11:31PM +0200, carlosgc wrote:
> Excerpts from suzuki toshiya's message of mié sep 15 12:16:22 +0200 2010:
> > Hi,
> 
> Hi, 
> 
> > Attached patches are the introduction of new API to access raw text.
> > I wish some maintainer of poppler-glib can review it.
> 
> Yes, sorry for the delay. 
> 
> > poppler-0.15.0_glib-lib.diff
> > patch to declare new function and its implementation
> > 
> 
> I prefer poppler_page_get_raw_text(), rather than
> poppler_page_get_selected_raw_text(), and always return the text of
> the whole page. I don't see why you might want the selected text in
> raw order.

I've made that function. Here's the patch.
-------------- next part --------------
From 389d49e3413ce09601b574308bd6bbd46044e6b3 Mon Sep 17 00:00:00 2001
From: danigm <danigm at wadobo.com>
Date: Wed, 5 Jan 2011 14:07:59 +0100
Subject: [PATCH] [glib] Added poppler_page_get_raw_text function

---
 glib/poppler-page.cc |   54 +++++++++++++++++++++++++++++++++++++++++++++++++-
 glib/poppler-page.h  |    1 +
 2 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc
index a8e6b2d..8966f7e 100644
--- a/glib/poppler-page.cc
+++ b/glib/poppler-page.cc
@@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, PopplerRectangle *rect)
  * This array must be freed with g_free () when done.
  *
  * The position in the array represents an offset in the text returned by
- * poppler_page_get_text()
+ * poppler_page_get_raw_text()
  *
  * Return value: %TRUE if the page contains text, %FALSE otherwise
  *
@@ -2200,3 +2200,55 @@ poppler_page_get_text_layout (PopplerPage       *page,
 
   return TRUE;
 }
+
+/**
+ * poppler_page_get_raw_text:
+ * @page: A #PopplerPage
+ *
+ * Return value: a pointer to the text page in raw order
+ *               as a string
+ *
+ **/
+char *
+poppler_page_get_raw_text (PopplerPage *page)
+{
+  TextPage *text;
+  TextWordList *wordlist;
+  TextWord *word, *nextword;
+  char *craw_text;
+  GooString *raw_text;
+  int i = 0;
+
+  raw_text = new GooString();
+
+  g_return_val_if_fail (POPPLER_IS_PAGE (page), FALSE);
+
+  text = poppler_page_get_text_page (page);
+  wordlist = text->makeWordList (gFalse);
+
+  if (wordlist->getLength () <= 0)
+    return NULL;
+
+  for (i = 0; i < wordlist->getLength (); i++)
+  {
+    word = wordlist->get (i);
+    raw_text->append (word->getText ());
+
+    nextword = word->getNext ();
+    if (nextword)
+    {
+      raw_text->append (' ');
+    }
+    else
+    {
+      raw_text->append ('\n');
+    }
+  }
+
+  craw_text = g_strdup (raw_text->getCString ());
+
+  delete wordlist;
+  delete raw_text;
+
+  return craw_text;
+}
diff --git a/glib/poppler-page.h b/glib/poppler-page.h
index d40c0ee..333cb23 100644
--- a/glib/poppler-page.h
+++ b/glib/poppler-page.h
@@ -128,6 +128,7 @@ void 		      poppler_page_get_crop_box 	 (PopplerPage        *page,
 gboolean               poppler_page_get_text_layout      (PopplerPage        *page,
                                                           PopplerRectangle  **rectangles,
                                                           guint              *n_rectangles);
+char                  *poppler_page_get_raw_text         (PopplerPage        *page);
 
 /* A rectangle on a page, with coordinates in PDF points. */
 #define POPPLER_TYPE_RECTANGLE             (poppler_rectangle_get_type ())
-- 
1.7.3.4.742.g987cd

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110105/baf4af65/attachment.pgp>


More information about the poppler mailing list