[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)
Daniel Flipo
daniel.flipo at free.fr
Sat Sep 9 18:03:10 UTC 2017
Le 02/09/2017 à 10:19, Daniel Flipo a écrit :
> Even when pdftotext is run with option "-enc UTF-8", it converts all
> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
> whether this feature is intended or not.
Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
function UnicodeIsWhitespace lists all Unicode spaces on which to break
lines into words (used *only* in TextOutputDev.cc line 2610).
UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
(all other are fine). Is it intended to break lines into words on
/non-breaking/ spaces?
Deleting those two characters from UnicodeIsWhiteSpace and recompiling
poppler built a binary pdftotext which works fine for me now… but I am
not sure it doesn't break anything else in poppler.
Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace
be investigated?
Thanks in advance, cheers,--
Daniel Flipo
More information about the poppler
mailing list