[poppler] [PATCH] Fixup LaTeX composed characters

Sun May 15 06:57:09 PDT 2011

Tim - be aware that the PDF standard (ISO 32000-1:2008) refers to a specific version of Unicode (v4).  Support for any newer version could potentially introduce compatibility issues.

For the next version of PDF (2.0, ISO 32000-2) we are evaluating updating that reference.

Leonard

-----Original Message-----
From: poppler-bounces+leonardr=adobe.com at lists.freedesktop.org [mailto:poppler-bounces+leonardr=adobe.com at lists.freedesktop.org] On Behalf Of Tim Brody
Sent: Wednesday, May 11, 2011 2:58 AM
To: poppler at lists.freedesktop.org
Subject: Re: [poppler] [PATCH] Fixup LaTeX composed characters

On Tue, 10 May 2011 19:15:51 +0100, Albert Astals Cid <aacid at kde.org>
wrote:
> A Tuesday, May 10, 2011, Tim Brody va escriure:
>> > Sincerely i am quite hesitant to apply your patch since it "breaks"
>> > pdftotext
>> > usage in the console (since it seems most of the apps in the console
>> > are
>> > not
>> > able to understand the non-composed form)
>> 

>> Anyway, my patch is only a fix-up of overprinting characters that would
>> otherwise get mangled by pfdtotext. It just makes it more apparent that
>> your tool-chain is broken because it's producing more non-ASCII7
>> code-points.
> 
> By tool-chain you mean pdftotext?

I mean whatever you're piping to. I haven't encountered a problem with
decomposed Unicode in bash/less/vim.

>> I agree that pdftotext should by default output NFC but you need to
>> decide
>> whether to implement an NFC against the out of date poppler tables or
>> link
>> to icu.
> 
> I don't think linking to icu (which last i checked is a huuuuuuuuuge
> monster 
> way bigger than poppler itself in size), otoh why you say poppler tables
> are 
> out of date? Nobody has complained about something not working :D

Normalisation relies on the canonical character compositions, which come
from the Unicode tables. The poppler .h files are dated 2008 and there have
been two new Unicode versions since 2008 (assuming the tables used then
were current). I'm not saying they're broken but that Unicode tables
have/will change.

Regardless, I will normalise the output from pdftotext to NFKC anyway - I
just need it to not mangle TeX-generated PDFs. I don't see this as
dependent on fixing pfdtotext's normalisation.

-- 
All the best,
Tim.
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler