[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Sun Sep 10 08:17:52 UTC 2017

Le 10/09/2017 à 02:17, Adrian Johnson a écrit :
>> On 10/09/17 03:33, Daniel Flipo wrote:
>>
>> Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
>> function UnicodeIsWhitespace lists all Unicode spaces on which to break
>> lines into words (used *only* in TextOutputDev.cc line 2610).
>>
>> UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
>> (all other are fine). Is it intended to break lines into words on
>> /non-breaking/ spaces?
> 
> The bug that added that code is:
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=97399
> 
> So at least for some PDFs, yes it is intentional. I tested with Adobe
> Reader and it is also converting non-breaking spaces to U+0020.

OK thanks, I understand the situation now.

> The solution is not as simple as removing U+00A0 from
> UnicodeIsWhitespace. That doesn't mean we can't do a better job of
> handling non-breaking space in PDFs. But it would require a non-trivial
> solution. Maybe check the ratio of non-breaking space characters to
> space characters on a page. If there is more non-breaking space than
> space, assume the PDF is broken and convert to space. If there is less
> non-breaking space than space, preserve the non-breaking space characters.

Instead of relying on a statistical test to decide whether U+00A0 is an
intentional nbsp or not, I suggest to add an option to pdftotext which
would remove U+00A0 and U+202F from UnicodeIsWhitespace for users who do
want nbsp to be honoured.

What do you think?

-- 
Daniel Flipo