[poppler] [PATCH] Fixup LaTeX composed characters

Tim Brody tdb2 at ecs.soton.ac.uk
Wed May 11 02:58:25 PDT 2011

On Tue, 10 May 2011 19:15:51 +0100, Albert Astals Cid <aacid at kde.org>
> A Tuesday, May 10, 2011, Tim Brody va escriure:
>> > Sincerely i am quite hesitant to apply your patch since it "breaks"
>> > pdftotext
>> > usage in the console (since it seems most of the apps in the console
>> > are
>> > not
>> > able to understand the non-composed form)

>> Anyway, my patch is only a fix-up of overprinting characters that would
>> otherwise get mangled by pfdtotext. It just makes it more apparent that
>> your tool-chain is broken because it's producing more non-ASCII7
>> code-points.
> By tool-chain you mean pdftotext?

I mean whatever you're piping to. I haven't encountered a problem with
decomposed Unicode in bash/less/vim.

>> I agree that pdftotext should by default output NFC but you need to
>> decide
>> whether to implement an NFC against the out of date poppler tables or
>> link
>> to icu.
> I don't think linking to icu (which last i checked is a huuuuuuuuuge
> monster 
> way bigger than poppler itself in size), otoh why you say poppler tables
> are 
> out of date? Nobody has complained about something not working :D

Normalisation relies on the canonical character compositions, which come
from the Unicode tables. The poppler .h files are dated 2008 and there have
been two new Unicode versions since 2008 (assuming the tables used then
were current). I'm not saying they're broken but that Unicode tables
have/will change.

Regardless, I will normalise the output from pdftotext to NFKC anyway - I
just need it to not mangle TeX-generated PDFs. I don't see this as
dependent on fixing pfdtotext's normalisation.

All the best,

More information about the poppler mailing list