[Poppler-bugs] [Bug 102651] New: pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Sep 11 09:31:29 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=102651

            Bug ID: 102651
           Summary: pdftotext converts all non-breaking spaces U+A0 and
                    U+202F into U+20
           Product: poppler
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: utils
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: daniel.flipo at free.fr

Created attachment 134154
  --> https://bugs.freedesktop.org/attachment.cgi?id=134154&action=edit
PDF file with non-breaking spaces to be preserved

Correction of bug #97399 lead to add non-breaking spaces U+A0 and U+202F to
function UnicodeIsWhitespace which holds the list of all spaces used to break
lines into words.

As a result, these non-breaking spaces are converted into breakable U+20 spaces
by  pdftotext. In some cases (ties like Mr Bean, high punctuation in French,
etc.) these non-breaking spaces are intentionally added and should be preserved
as such in the text or html output.

An option to pdftotext enabling to remove these two spaces from
UnicodeIsWhitespace would solve the issue.

I append a a small PDF file with those non-breaking spaces for testing.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20170911/2e0fa322/attachment.html>


More information about the Poppler-bugs mailing list