[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)
Adrian Johnson
ajohnson at redneon.com
Sun Sep 10 00:17:58 UTC 2017
On 10/09/17 03:33, Daniel Flipo wrote:
>
> Le 02/09/2017 à 10:19, Daniel Flipo a écrit :
>
>> Even when pdftotext is run with option "-enc UTF-8", it converts all
>> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
>> whether this feature is intended or not.
>
> Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
> function UnicodeIsWhitespace lists all Unicode spaces on which to break
> lines into words (used *only* in TextOutputDev.cc line 2610).
>
> UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
> (all other are fine). Is it intended to break lines into words on
> /non-breaking/ spaces?
The bug that added that code is:
https://bugs.freedesktop.org/show_bug.cgi?id=97399
So at least for some PDFs, yes it is intentional. I tested with Adobe
Reader and it is also converting non-breaking spaces to U+0020.
The solution is not as simple as removing U+00A0 from
UnicodeIsWhitespace. That doesn't mean we can't do a better job of
handling non-breaking space in PDFs. But it would require a non-trivial
solution. Maybe check the ratio of non-breaking space characters to
space characters on a page. If there is more non-breaking space than
space, assume the PDF is broken and convert to space. If there is less
non-breaking space than space, preserve the non-breaking space characters.
I suggest creating a bug for this and attaching your test cases. Also
attach some real world examples so we can see the ratios of space to
non-breaking space characters.
>
> Deleting those two characters from UnicodeIsWhiteSpace and recompiling
> poppler built a binary pdftotext which works fine for me now… but I am
> not sure it doesn't break anything else in poppler.
>
> Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace
> be investigated?
>
> Thanks in advance, cheers,--
> Daniel Flipo
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
>
More information about the poppler
mailing list