[poppler] Characters with accents not correctly handled

Laurent Aguerreche laurent.aguerreche at free.fr
Tue Aug 21 13:47:45 PDT 2007


Le mardi 21 août 2007 à 22:30 +0200, Albert Astals Cid a écrit :
> A Dilluns 20 Agost 2007, Carl Worth va escriure:
> > On Sun, 19 Aug 2007 22:46:16 +0200, Laurent Aguerreche wrote:
> > > But the real problem is that it is impossible to recognize :
> > > - "fi" as "fi" too
> > > - "ff" as "ff" too
> > > Would it be possible to add a new parameter to pdftotext to make it
> > > ignore ligatures but still export in UTF-8?
> >
> > It's quite preferable to have the ligatures in your PDF file.
> >
> > The bug to fix is that poppler should expand the ligatures to their
> > normalized forms when extracting the text.
> 
> Actually i disagree, if you have æ do you want to get it expanded to ae too? 
> If not why you want it with the ff ligature?

I think there are two cases here :
- "ff" is composed of two characters but relied (= ligature) when
displayed only. When wrote by hands, it is "ff";
- "æ" is always wrote "a" with "e".

(Indeed I do not know what language you are talking about as example but
I know the case of word "cœur" (= heart) in french: write it "coeur" is
always wrong).


Laurent.

> Albert
> 
> >
> > That bug was first reported here:
> >
> > 	Text extraction should expand ligatures to their normal form
> > 	https://bugs.freedesktop.org/show_bug.cgi?id=7002
> >
> > -Carl
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: Ceci est une partie de message
	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Url : http://lists.freedesktop.org/archives/poppler/attachments/20070821/94329931/attachment.pgp 


More information about the poppler mailing list