[poppler] [PATCH] Fixup LaTeX composed characters

Tue May 10 02:50:40 PDT 2011

On Mon, 9 May 2011 19:52:38 +0100, Albert Astals Cid <aacid at kde.org> wrote:
> A Monday, May 09, 2011, Tim Brody va escriure:
>> On Sat, 2011-05-07 at 19:02 +0100, Jonathan Kew wrote:
>> > On 7 May 2011, at 17:43, Albert Astals Cid wrote:
>> > > A Friday, April 01, 2011, Albert Astals Cid va escriure:
>> > >> A Divendres, 1 d'abril de 2011, Tim Brody va escriure:
>> > >>> On Thu, 31 Mar 2011 23:28:02 +0100, Albert Astals Cid
>> > >>> <aacid at kde.org>
>> > >>> 
>> > >>> wrote:
>> > >>>> A Dimecres, 30 de març de 2011, vàreu escriure:
>> > >>>>> On Tue, 2011-03-29 at 22:45 +0100, Albert Astals Cid wrote:
>> > >>>>>>>> I still get
>> > >>>>>>>> 
>> > >>>>>>>> -R. L¨wen and B. Polster
>> > >>>>>>>> -o
>> > >>>>>>>> +R. Lowen and B. Polster
>> > >>>>>>>> 
>> > >>>>>>>> Maybe you sent a old version of the patch? Can anyone confirm
>> > >>>>>>>> if
>> > >>>> 
>> > >>>> My bad, somehow vi/diff/less are showing me o but if i open it in
>> > >>>> kate i see
>> > >>>> an ö
>> > >>> 
>> > >>> That will be because it's separate characters (X + combining
char).
>> > >>> You could normalise with unicodeNormalizeNFKC but I thought it
>> > >>> probably better to leave text - as far as possible - unchanged
from
>> > >>> the PDF source.
>> > >> 
>> > >> Hmmmmmm, since we are already changing the "real" representation of
>> > >> the text (i.e transforming it from broken to not broken), i think i
>> > >> prefer one that is easy to use (i.e. shows ö in most of the
tools),
>> > >> what do others think?
>> > > 
>> > > Since the others are not there, please do what i want and output a
>> > > real
>> > > ö
>> > 
>> > If you're going to apply a Unicode normalization process, please use
>> > 
>> >  NFC rather than NFKC. This will deal with creating precomposed
>> >  letter+accent combinations, but avoids introducing "compatibility"
>> >  changes that may lose significant distinctions in the text.
>> 
>> For reference:
>> NFC = pre-composed
>> NFKC = pre-composed plus simplified ligatures ('fi' => 'f'+'i')
>> 
>> I agree but there isn't an NFC in poppler. It seems a waste of time to
>> be writing one from scratch in Poppler or is there really no Unicode
>> library that provides normalisations?
> 
> Couldn't you have said that (we have no code to compose stuff) when I
asked
> the list if we wanted composed or not?

poppler has a NFKC implementation which is used by the internal
word-search. I haven't found an indication of what version the Unicode
tables are.

> Sincerely i am quite hesitant to apply your patch since it "breaks"
> pdftotext 
> usage in the console (since it seems most of the apps in the console are
> not 
> able to understand the non-composed form)

Your initial response to fixing-up LaTeX generated PDFs was "fix LaTeX" but
now you're saying we should make poppler work around broken shell tools?
:-)

Anyway, my patch is only a fix-up of overprinting characters that would
otherwise get mangled by pfdtotext. It just makes it more apparent that
your tool-chain is broken because it's producing more non-ASCII7
code-points.

I agree that pdftotext should by default output NFC but you need to decide
whether to implement an NFC against the out of date poppler tables or link
to icu.

-- 
All the best,
Tim.