[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Sat Sep 9 18:03:10 UTC 2017

Le 02/09/2017 à 10:19, Daniel Flipo a écrit :

> Even when pdftotext is run with option "-enc UTF-8", it converts all
> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
> whether this feature is intended or not.

Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
function UnicodeIsWhitespace lists all Unicode spaces on which to break
lines into words (used *only* in TextOutputDev.cc line 2610).

UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
(all other are fine). Is it intended to break lines into words on
/non-breaking/ spaces?

Deleting those two characters from UnicodeIsWhiteSpace and recompiling
poppler built a binary pdftotext which works fine for me now… but I am
not sure it doesn't break anything else in poppler.

Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace
be investigated?

Thanks in advance, cheers,--
Daniel Flipo