[Poppler-bugs] [Bug 102651] New: pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Mon Sep 11 09:31:29 UTC 2017
https://bugs.freedesktop.org/show_bug.cgi?id=102651
Bug ID: 102651
Summary: pdftotext converts all non-breaking spaces U+A0 and
U+202F into U+20
Product: poppler
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: medium
Component: utils
Assignee: poppler-bugs at lists.freedesktop.org
Reporter: daniel.flipo at free.fr
Created attachment 134154
--> https://bugs.freedesktop.org/attachment.cgi?id=134154&action=edit
PDF file with non-breaking spaces to be preserved
Correction of bug #97399 lead to add non-breaking spaces U+A0 and U+202F to
function UnicodeIsWhitespace which holds the list of all spaces used to break
lines into words.
As a result, these non-breaking spaces are converted into breakable U+20 spaces
by pdftotext. In some cases (ties like Mr Bean, high punctuation in French,
etc.) these non-breaking spaces are intentionally added and should be preserved
as such in the text or html output.
An option to pdftotext enabling to remove these two spaces from
UnicodeIsWhitespace would solve the issue.
I append a a small PDF file with those non-breaking spaces for testing.
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20170911/2e0fa322/attachment.html>
More information about the Poppler-bugs
mailing list