[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout

Jonathan Kew jfkthame at googlemail.com
Tue Dec 20 03:32:16 PST 2011


On 20 Dec 2011, at 00:56, Rupert Swarbrick wrote:

> I've spent a little more time with the question, and made a C-based
> tester to avoid any influence from lispyness. I've attached the code
> below.
> 
> On a PDF of the source code for the tester, which I've also attached, I
> get the expected output (I think). Both lengths are 1544, and the
> numbers increase in an obvious manner.
> 
> However, the Microchip datasheet on which I was testing my code still
> fails weirdly. You can get it from [1] (not sure about whether I should
> be posting it to a mailing list) and the first few lines I get look
> like:
> 
> String length: 1541
> Rectangles:    1477

Guessing, without having looked at the code or API involved... is the "string length" here a count of UTF-8 _bytes_, but the "rectangles" are one per _character_? If so, you'd get a discrepancy as soon as non-ASCII characters (such as bullets, curly quotes, em-dashes, accented letters, etc, etc) are present.

JK



More information about the poppler mailing list