[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Sat Sep 2 08:19:20 UTC 2017

Hi all,

Even when pdftotext is run with option "-enc UTF-8", it converts all
non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
whether this feature is intended or not.

In French, high punctuation characters (:;!?) should be preceeded with
non-breaking spaces; Unicode characters, U+a0 for `:' and U+202f (thin)
for the three others, are perfect for this purpose.

French quote characters `«' and `»' also need U+a0 non-breaking space.

When I run:

pdftotext -enc UTF-8  file.pdf file.txt

on a Unicode encoded PDF file which holds such sequences, the output
file shows all high punctuation characters preceeded with the same
breakable U+20 space, which looks wrong to me.

I am using version 0.48 included in Debian Stretch.

I append a simple test file "spaces.pdf" (fyi it was produced by LuaTeX)
and "spaces.txt" the output of "pdftotext -enc UTF-8".

Cheers,
-- 
Daniel Flipo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spaces.pdf
Type: application/pdf
Size: 3928 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20170902/88525379/attachment-0001.pdf>
-------------- next part --------------
a : b ; c ! d ! « x ».