[Poppler-bugs] [Bug 66693] Greek support package - some characters output as symbols not letters

Wed Sep 18 03:00:50 PDT 2013

https://bugs.freedesktop.org/show_bug.cgi?id=66693

--- Comment #20 from Govert <noliturbarecom at gmail.com> ---
My apologies for this very late reaction. Just found this message (“Comment #
19 below) in my “unwanted mail” folder.

I am not a specialist in this field, I’m not even Greek... The problem that I
experienced is not about symbols being output as other symbols or even the need
for that, it’s about certain letters in the PDF being output as symbols in the
text.

Example: a PDF containing text  with a word that is displayed as ΑΓΝΩΣΤΗ in
Adobe Reader (copied/pasted here from the Reader) is output as ΑΓΝΩΣΤΗ by
pdftotext (copied/pasted here from the text output). Those words look equal,
but they are not because the original Ω and the Ω in the text output are
different. The Ω in the text output is the symbol omega (as used a.o. in
electronics), not the letter Ω. Searching for “ΑΓΝΩΣΤΗ” in the text output
finds nothing.

I am using a (stupid?) workaround for the time being: I convert all symbols
'µ', '∆' and 'Ω' in the text output to the letters 'μ', 'Δ' and 'Ω' before
starting the search.

From: bugzilla-daemon at freedesktop.org 
Sent: Sunday, August 25, 2013 9:58 PM
To: noliturbarecom at gmail.com 
Subject: [Bug 66693] Greek support package - some characters output as symbols
not letters

Comment # 19 on bug 66693 from Albert Astals Cid 
To be honest, i don't see why pdftotext should output a symbol as another
symbol, unless it's obvious that the first symbol is *exclusively* there for a
typographical nature, like the "fl", "fi", ligatures.

OTOH if the code is not a lot to maintain I would not be opposed to add a non
default option that did that conversion.

About searching, yes, i agree it makes sense that if you search for Symbol1 and
what's on the pdf is Symbol2 (but that is "technically" the same thing), it
would make sense sense that the search algorithm tries to match it, but I would
still want the "getPageText()" methods to give me Symbol2 (i.e. what was really
on the pdf file).

So as far as I can see here there's two thigs happening in this bug:
 a) pdftotext doing conversion of some symbols to others
 b) search handling symbol mappings

Am I right in the analysis?

Now my question, how much is a) related to b). Can it be handled in different
bugs or it makes more sense to handle them together here?
--------------------------------------------------------------------------------
You are receiving this mail because: 
  a.. You reported the bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20130918/d5cb06ca/attachment.html>