<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body><table border="1" cellspacing="0" cellpadding="8"> <tr> <th>Priority</th> <td>medium </td> </tr> <tr> <th>Bug ID</th> <td><a class="bz_bug_link bz_status_NEW " title="NEW --- - poppler_page_get_text() ordering does not agree with poppler_page_get_text_layout() as docs say it should" href="https://bugs.freedesktop.org/show_bug.cgi?id=69608">69608</a> </td> </tr> <tr> <th>Assignee</th> <td>poppler-bugs@lists.freedesktop.org </td> </tr> <tr> <th>Summary</th> <td>poppler_page_get_text() ordering does not agree with poppler_page_get_text_layout() as docs say it should </td> </tr> <tr> <th>Severity</th> <td>normal </td> </tr> <tr> <th>Classification</th> <td>Unclassified </td> </tr> <tr> <th>OS</th> <td>Linux (All) </td> </tr> <tr> <th>Reporter</th> <td>peter.waller@gmail.com </td> </tr> <tr> <th>Hardware</th> <td>x86-64 (AMD64) </td> </tr> <tr> <th>Status</th> <td>NEW </td> </tr> <tr> <th>Version</th> <td>unspecified </td> </tr> <tr> <th>Component</th> <td>glib frontend </td> </tr> <tr> <th>Product</th> <td>poppler </td> </tr></table> <p> <div> <pre>Whilst trying to extract textual information from PDFs, it seems that the documentation for poppler_page_get_text_layout() is not correct, or there is a bug in poppler_page_get_text(). The documentation for poppler_page_get_text_layout says: "The position in the array represents an offset in the text returned by poppler_page_get_text()". (Note that the documentation says the same for poppler_page_get_text_attributes). However, this doesn't seem to be the case. The problem is described succinctly here, complete with a short piece code which reproduces the problem: <a href="http://www.mail-archive.com/poppler@lists.freedesktop.org/msg06238.html">http://www.mail-archive.com/poppler@lists.freedesktop.org/msg06238.html</a> The linked PDF [1] gives 1541 glyphs from poppler_page_get_text and 1477 glyphs from poppler_page_get_text_layout. It does not appear to be related to unicode encoding. In addition to the numbers of glyphs not agreeing, the order doesn't seem to match up either, from what I can tell. [1] <a href="http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf">http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf</a></pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>