[Poppler-bugs] [Bug 87215] evince can not find ü in attached PDF

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Mar 19 21:30:08 PDT 2015


https://bugs.freedesktop.org/show_bug.cgi?id=87215

Jason Crain <jason at aquaticape.us> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #113036|0                           |1
        is obsolete|                            |

--- Comment #10 from Jason Crain <jason at aquaticape.us> ---
Created attachment 114485
  --> https://bugs.freedesktop.org/attachment.cgi?id=114485&action=edit
Combine base characters and diacritical marks

My attempt to improve this.

When you make a diacriticized character with LaTeX, ü for example, it will make
a PDF with separate u and ¨ characters and draw them over each other.  This
patch detects when this happens and converts it to a combining character
sequence so that pdftotext and the search function will see a ü and not
separate characters.  Also refactors some (TextWord::ensureCapacity and
TextWord::setInitialBounds) to avoid duplicating code.

Limitations:

It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under
letter or \d for dot under letter, because they are positioned differently and
\d would be easy to confuse with a period.  They don't seem to be used very
often though.

If the base character is unusual, such as a math symbol or number, adding a
combining character can make the result of pdftotext look a bit odd.  I think
this is because if the font or rendering engine don't know how to draw the
character sequence, it will place the diacritic in a strange position, such as
to the right of the letter.  In these cases, the output of pdftotext is
technically correct, it just looks odd when drawn on screen.

When selecting text in evince, you can separately select the character and
diacritic.  If that's a problem, I think I could fix it by adding clustering
support so that a group of glyphs and characters are treated as a single unit. 
It would make this a much more invasive change, but maybe I should try it
anyway.  It would be nice to also fix the assumpution that one glyph is always
matched 1 character.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20150320/e6f85248/attachment.html>


More information about the Poppler-bugs mailing list