[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Adrian Johnson ajohnson at redneon.com
Sun Sep 10 00:17:58 UTC 2017


On 10/09/17 03:33, Daniel Flipo wrote:
> 
> Le 02/09/2017 à 10:19, Daniel Flipo a écrit :
> 
>> Even when pdftotext is run with option "-enc UTF-8", it converts all
>> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
>> whether this feature is intended or not.
> 
> Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
> function UnicodeIsWhitespace lists all Unicode spaces on which to break
> lines into words (used *only* in TextOutputDev.cc line 2610).
> 
> UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
> (all other are fine). Is it intended to break lines into words on
> /non-breaking/ spaces?

The bug that added that code is:

https://bugs.freedesktop.org/show_bug.cgi?id=97399

So at least for some PDFs, yes it is intentional. I tested with Adobe
Reader and it is also converting non-breaking spaces to U+0020.

The solution is not as simple as removing U+00A0 from
UnicodeIsWhitespace. That doesn't mean we can't do a better job of
handling non-breaking space in PDFs. But it would require a non-trivial
solution. Maybe check the ratio of non-breaking space characters to
space characters on a page. If there is more non-breaking space than
space, assume the PDF is broken and convert to space. If there is less
non-breaking space than space, preserve the non-breaking space characters.

I suggest creating a bug for this and attaching your test cases. Also
attach some real world examples so we can see the ratios of space to
non-breaking space characters.

> 
> Deleting those two characters from UnicodeIsWhiteSpace and recompiling
> poppler built a binary pdftotext which works fine for me now… but I am
> not sure it doesn't break anything else in poppler.
> 
> Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace
> be investigated?
> 
> Thanks in advance, cheers,--
> Daniel Flipo
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 



More information about the poppler mailing list