[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)
Daniel Flipo
daniel.flipo at free.fr
Sat Sep 2 08:19:20 UTC 2017
Hi all,
Even when pdftotext is run with option "-enc UTF-8", it converts all
non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
whether this feature is intended or not.
In French, high punctuation characters (:;!?) should be preceeded with
non-breaking spaces; Unicode characters, U+a0 for `:' and U+202f (thin)
for the three others, are perfect for this purpose.
French quote characters `«' and `»' also need U+a0 non-breaking space.
When I run:
pdftotext -enc UTF-8 file.pdf file.txt
on a Unicode encoded PDF file which holds such sequences, the output
file shows all high punctuation characters preceeded with the same
breakable U+20 space, which looks wrong to me.
I am using version 0.48 included in Debian Stretch.
I append a simple test file "spaces.pdf" (fyi it was produced by LuaTeX)
and "spaces.txt" the output of "pdftotext -enc UTF-8".
Cheers,
--
Daniel Flipo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spaces.pdf
Type: application/pdf
Size: 3928 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20170902/88525379/attachment-0001.pdf>
-------------- next part --------------
a : b ; c ! d ! « x ».
More information about the poppler
mailing list