[poppler] [PATCH] Fixup LaTeX composed characters
Tim Brody
tdb2 at ecs.soton.ac.uk
Tue May 10 02:50:40 PDT 2011
On Mon, 9 May 2011 19:52:38 +0100, Albert Astals Cid <aacid at kde.org> wrote:
> A Monday, May 09, 2011, Tim Brody va escriure:
>> On Sat, 2011-05-07 at 19:02 +0100, Jonathan Kew wrote:
>> > On 7 May 2011, at 17:43, Albert Astals Cid wrote:
>> > > A Friday, April 01, 2011, Albert Astals Cid va escriure:
>> > >> A Divendres, 1 d'abril de 2011, Tim Brody va escriure:
>> > >>> On Thu, 31 Mar 2011 23:28:02 +0100, Albert Astals Cid
>> > >>> <aacid at kde.org>
>> > >>>
>> > >>> wrote:
>> > >>>> A Dimecres, 30 de març de 2011, vàreu escriure:
>> > >>>>> On Tue, 2011-03-29 at 22:45 +0100, Albert Astals Cid wrote:
>> > >>>>>>>> I still get
>> > >>>>>>>>
>> > >>>>>>>> -R. L¨wen and B. Polster
>> > >>>>>>>> -o
>> > >>>>>>>> +R. Lowen and B. Polster
>> > >>>>>>>>
>> > >>>>>>>> Maybe you sent a old version of the patch? Can anyone confirm
>> > >>>>>>>> if
>> > >>>>
>> > >>>> My bad, somehow vi/diff/less are showing me o but if i open it in
>> > >>>> kate i see
>> > >>>> an ö
>> > >>>
>> > >>> That will be because it's separate characters (X + combining
char).
>> > >>> You could normalise with unicodeNormalizeNFKC but I thought it
>> > >>> probably better to leave text - as far as possible - unchanged
from
>> > >>> the PDF source.
>> > >>
>> > >> Hmmmmmm, since we are already changing the "real" representation of
>> > >> the text (i.e transforming it from broken to not broken), i think i
>> > >> prefer one that is easy to use (i.e. shows ö in most of the
tools),
>> > >> what do others think?
>> > >
>> > > Since the others are not there, please do what i want and output a
>> > > real
>> > > ö
>> >
>> > If you're going to apply a Unicode normalization process, please use
>> >
>> > NFC rather than NFKC. This will deal with creating precomposed
>> > letter+accent combinations, but avoids introducing "compatibility"
>> > changes that may lose significant distinctions in the text.
>>
>> For reference:
>> NFC = pre-composed
>> NFKC = pre-composed plus simplified ligatures ('fi' => 'f'+'i')
>>
>> I agree but there isn't an NFC in poppler. It seems a waste of time to
>> be writing one from scratch in Poppler or is there really no Unicode
>> library that provides normalisations?
>
> Couldn't you have said that (we have no code to compose stuff) when I
asked
> the list if we wanted composed or not?
poppler has a NFKC implementation which is used by the internal
word-search. I haven't found an indication of what version the Unicode
tables are.
> Sincerely i am quite hesitant to apply your patch since it "breaks"
> pdftotext
> usage in the console (since it seems most of the apps in the console are
> not
> able to understand the non-composed form)
Your initial response to fixing-up LaTeX generated PDFs was "fix LaTeX" but
now you're saying we should make poppler work around broken shell tools?
:-)
Anyway, my patch is only a fix-up of overprinting characters that would
otherwise get mangled by pfdtotext. It just makes it more apparent that
your tool-chain is broken because it's producing more non-ASCII7
code-points.
I agree that pdftotext should by default output NFC but you need to decide
whether to implement an NFC against the out of date poppler tables or link
to icu.
--
All the best,
Tim.
More information about the poppler
mailing list