[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout
Jonathan Kew
jfkthame at googlemail.com
Tue Dec 20 03:32:16 PST 2011
On 20 Dec 2011, at 00:56, Rupert Swarbrick wrote:
> I've spent a little more time with the question, and made a C-based
> tester to avoid any influence from lispyness. I've attached the code
> below.
>
> On a PDF of the source code for the tester, which I've also attached, I
> get the expected output (I think). Both lengths are 1544, and the
> numbers increase in an obvious manner.
>
> However, the Microchip datasheet on which I was testing my code still
> fails weirdly. You can get it from [1] (not sure about whether I should
> be posting it to a mailing list) and the first few lines I get look
> like:
>
> String length: 1541
> Rectangles: 1477
Guessing, without having looked at the code or API involved... is the "string length" here a count of UTF-8 _bytes_, but the "rectangles" are one per _character_? If so, you'd get a discrepancy as soon as non-ASCII characters (such as bullets, curly quotes, em-dashes, accented letters, etc, etc) are present.
JK
More information about the poppler
mailing list