[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout

Rupert Swarbrick rswarbrick at gmail.com
Tue Dec 20 03:40:01 PST 2011


Jonathan Kew <jfkthame at googlemail.com> writes:
> On 20 Dec 2011, at 00:56, Rupert Swarbrick wrote:
>> However, the Microchip datasheet on which I was testing my code still
>> fails weirdly. You can get it from [1] (not sure about whether I should
>> be posting it to a mailing list) and the first few lines I get look
>> like:
>> 
>> String length: 1541
>> Rectangles:    1477
>
> Guessing, without having looked at the code or API involved... is the
> "string length" here a count of UTF-8 _bytes_, but the "rectangles"
> are one per _character_? If so, you'd get a discrepancy as soon as
> non-ASCII characters (such as bullets, curly quotes, em-dashes,
> accented letters, etc, etc) are present.
>
> JK

That makes sense, but unfortunately I think I'm doing that right:

    printf ("String length: %u\nRectangles:    %u\n\n",
            g_utf8_strlen (text, -1), n_rects);

(the lisp code did that right first time. I corrected my strlen call to
g_utf8_strlen when I got a different answer from my C code!)

:-(

Rupert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 315 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111220/6515773f/attachment.pgp>


More information about the poppler mailing list