<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi, <div><br></div><div>We’ve been using poppler as a python extension module to turn a pdf to text and extract information about each token. We construct a textOutputDevice and then a textWordList from that, returning the font etc for each term. </div><div><br></div><div>One thing we’d like to add is which line the token appears on, and optionally its index in that line. Is there an easy way to do this given a TextOutDev ? </div><div><br></div><div>-kim </div><div><br></div><div><br></div><div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(209, 47, 27);"><span style="color: #78492a">#include </span>"poppler.h"</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(209, 47, 27);"><span style="color: #78492a">#include </span>"TextOutputDev.h"</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(120, 73, 42);">#include <span style="color: #d12f1b"><sstream></span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(120, 73, 42);">#include <span style="color: #d12f1b"><cstring></span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(209, 47, 27);"><span style="color: #78492a">#include </span>"PDFDocFactory.h"</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span style="color: #bb2ca2">const</span> <span style="color: #bb2ca2">double</span> PopplerParser::resolution = <span style="color: #272ad8">72.0</span>;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">PopplerParser::PopplerParser (<span style="color: #bb2ca2">const</span> std::string inputFilename) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GooString *ownerPW, *userPW;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>ownerPW = <span style="color: #bb2ca2">NULL</span>;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span>userPW = </span><span style="color: #bb2ca2">NULL</span><span style="color: #000000">; </span>//assume no user and owner passwords</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">char</span> st[inputFilename.length()+<span style="color: #272ad8">1</span>];</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>strcpy(st,inputFilename.c_str());</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GooString* fileName;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>fileName = <span style="color: #bb2ca2">new</span> GooString(st);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//create the document</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//assumes no owner or userpassword</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>PopplerParser::doc = PDFDocFactory().createPDFDoc(*fileName, ownerPW, userPW);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>PopplerParser::numPages = PopplerParser::doc->getNumPages();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">delete</span> fileName;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span style="color: #bb2ca2">int</span> PopplerParser::getPages() {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">return</span> PopplerParser::numPages; </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">PopplerParser::~PopplerParser() {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//delete PopplerParser::numPages;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">delete</span> PopplerParser::doc;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">std::string PopplerParser::Parse() {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GBool physLayout = gTrue;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GBool fixedPitch = gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GBool rawOrder = gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span>GBool htmlMeta = gTrue; </span>// required to get the bounding box information</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">int</span> firstPage = <span style="color: #272ad8">1</span>;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">int</span> lastPage = PopplerParser::doc->getNumPages();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>TextOutputDev *textOut;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>std::string page_text;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>std::string pages_text_data;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>std::stringstream ss;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//Word Features</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">double</span> xMinA, yMinA, xMaxA, yMaxA, r, g, b, fontSize;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>TextWord *word;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GooString* fontName;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>GBool underLined;</div><p style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><span class="Apple-tab-span" style="white-space:pre"> </span><br class="webkit-block-placeholder"></p><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>TextFontInfo *fontInfo; </div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>GBool fixedWidth = gFalse; </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>GBool serif = gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>GBool symbolic = gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>GBool italic = gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>GBool bold =gFalse;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//create our page</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span> </span>// read config file this is requried </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>globalParams = <span style="color: #bb2ca2">new</span> GlobalParams();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//create a textOut</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>textOut = <span style="color: #bb2ca2">new</span> TextOutputDev(<span style="color: #bb2ca2">NULL</span>, physLayout, fixedPitch, rawOrder, htmlMeta);</div><p style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><span class="Apple-tab-span" style="white-space:pre"> </span><br class="webkit-block-placeholder"></p><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//walk over the pages</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">for</span> (<span style="color: #bb2ca2">int</span> page = firstPage; page <= lastPage; ++page) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>PopplerParser::doc->displayPage(textOut, page, resolution, resolution, <span style="color: #272ad8">0</span>, gTrue, gFalse, gFalse);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>TextWordList *wordlist = textOut->makeWordList();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">const</span> <span style="color: #bb2ca2">int</span> word_length = wordlist != <span style="color: #bb2ca2">NULL</span> ? wordlist->getLength() : <span style="color: #272ad8">0</span>;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">if</span> (word_length > <span style="color: #272ad8">0</span>) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//words on the page</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">for</span> (<span style="color: #bb2ca2">int</span> i = <span style="color: #272ad8">0</span>; i < word_length; ++i) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>word = wordlist->get(i);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"> <span class="Apple-tab-span" style="white-space:pre"> </span></span>//Word Features</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>word->getColor(&r , &g, &b);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>underLined = word->isUnderlined();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>fontSize = word->getFontSize();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>word->getBBox(&xMinA, &yMinA, &xMaxA, &yMaxA);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>fontName = word->getFontName(<span style="color: #272ad8">0</span>);</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">const</span> std::string wordString = word->getText()->getCString();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #008400">//fontIno</span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>fontInfo = word->getFontInfo(<span style="color: #272ad8">0</span>); <span style="color: #008400">//do this for the first char in the word</span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>fontName = fontInfo->getFontName();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>fixedWidth = fontInfo ->isFixedWidth();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>serif = fontInfo->isSerif();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>symbolic = fontInfo->isSymbolic();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>italic = fontInfo->isItalic();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>bold = fontInfo->isBold();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"> </span>// escape quotes in string</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> std::stringstream newStr;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span style="color: #bb2ca2">for</span> (<span style="color: #bb2ca2">int</span> i = <span style="color: #272ad8">0</span>; i < wordString.length(); ++i) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span style="color: #bb2ca2">if</span> (wordString[i] == <span style="color: #272ad8">'"'</span> || wordString[i] == <span style="color: #272ad8">'\\'</span>) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> newStr << <span style="color: #d12f1b">"\\"</span>;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> } </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> newStr << wordString[i]; </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> }</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"> <span class="Apple-tab-span" style="white-space:pre"> </span></span>//construct our string output</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> ss << <span style="color: #d12f1b">"{"</span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\"xMin\":\""</span> << xMinA << <span style="color: #d12f1b">"\",\"yMin\":\""</span> << yMinA << <span style="color: #d12f1b">"\",\"xMax\":\""</span> << xMaxA << <span style="color: #d12f1b">"\",\"yMax\":\""</span> << yMaxA </div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(209, 47, 27);"><span style="color: #000000"> <span class="Apple-tab-span" style="white-space:pre"> </span><< </span>"\",\"red\":\""<span style="color: #000000"> << r << </span>"\",\"green\":\""<span style="color: #000000"> << g << </span>"\",\"blue\":\""<span style="color: #000000"><< b </span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"fontSize\":\""</span> << fontSize </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"italic\":\""</span> << italic </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"serif\":\""</span> << serif</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"symbolic\":\""</span> << symbolic</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"fixedWidth\":\""</span> << fixedWidth</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"bold\":\""</span> << bold</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"fontName\":\""</span> << fontName->getCString()</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span><< <span style="color: #d12f1b">"\",\"word\":\""</span> << newStr.str() << <span style="color: #d12f1b">"\",\"page\":\""</span><< page </div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> << <span style="color: #d12f1b">"\"}"</span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> << std::endl;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"> <span class="Apple-tab-span" style="white-space:pre"> </span></span>//std::cout << ss.str() << std::endl;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> <span class="Apple-tab-span" style="white-space:pre"> </span>}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>}</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; min-height: 13px;"><br></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">delete</span> textOut;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">delete</span> globalParams;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo; color: rgb(0, 132, 0);"><span style="color: #000000"><span class="Apple-tab-span" style="white-space:pre"> </span></span>//delete wordlist;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span>pages_text_data = ss.str();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space:pre"> </span><span style="color: #bb2ca2">return</span> pages_text_data;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">}</div></div></body></html>