<html>
<head>
<base href="https://bugs.documentfoundation.org/">
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED DUPLICATE - Text copied from a PDF exported using Linux Libertine G Graphite font is missing characters. (comment 24)"
href="https://bugs.documentfoundation.org/show_bug.cgi?id=124191#c37">Comment # 37</a>
on <a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED DUPLICATE - Text copied from a PDF exported using Linux Libertine G Graphite font is missing characters. (comment 24)"
href="https://bugs.documentfoundation.org/show_bug.cgi?id=124191">bug 124191</a>
from <span class="vcard"><a class="email" href="mailto:khaledhosny@eglug.org" title="Khaled Hosny (inactive) <khaledhosny@eglug.org>"> <span class="fn">Khaled Hosny (inactive)</span></a>
</span></b>
<pre>(In reply to V Stuart Foote from <a href="show_bug.cgi?id=124191#c31">comment #31</a>)
<span class="quote">> (In reply to Khaled Hosny (inactive) from <a href="show_bug.cgi?id=124191#c21">comment #21</a>)
> > > Here is result from a 6.2.1 build--note addition of the /ActualText
> > > structure, which helps with fidelity of pasted text. But that the
> > > LibreOffice generated /ToUnicode does look to have problems.
>
> > That is fine, it means there is no unique one to one, or one to many mapping
> > between these glyphs (not characters) and the input text, so no /ToUnicode
> > and /ActualText tagging is used for them.
>
> While things are much improved with HarfBuzz and moving the font handling
> into CommonSalLayout. But I'm still not sure this is correct, at least not
> in handling digraphs for the Graphite fonts.
>
> When LO exports to PDF the mapping of "The fire flying coffee left
> Quickly.", with Graphite font(s), the /ToUnicode stuct is getting an
> additional glyph added to the digraphs (both PUA and , and then is not
> mapping that glyph when it probably should.
>
> Use the below /ToUnicode chart with annotations, and read out the Tf[.*]TJ
> text runs (from LO 6.2.1) in <a href="show_bug.cgi?id=124191#c16">comment 16</a>
>
> <01> <005400680065> --> "The", but maybe should be just "Th"?
> x <02> -- "e" not mapped</span >
That is how the fonts are built:
$ hb-shape /usr/share/fonts/TTF/LinLibertine_R_G.ttf "The fire" --no-positions
[T_h=0|e=0|space=3|f_i=4|r=4|e=7]
The numbers after each glyph is the index of the character in belongs to in the
input string. Here both <T_h> and <e> glyphs get index 0 and the next glyph,
<space>, gets index 3. So for us this means that, the first three characters,
“the”, make a two glyph cluster, <T_h><e>, and we can’t tell which of the three
characters belongs to which glyph and thus bundle them as a single unit.
Now, /ToUnicode allows only one to one and one to many mappings, but not many
to many that we need here, so we use an /ActualText tag.
For maximum compatibility with PDF readers not supporting /ActualText we also
add, as a last resort, a /ToUnicode entry for the first glyph, <T_h>, mapping
it to the three characters and skip the second glyph <e>. This is not ideal,
but at least one gets some text (and spurious chars for the unmapped glyphs) on
such readers.
So basically it is a faulty font, that <e> should have gotten index 2 not 0,
and us are doing our best to accommodate limitations of PDF format and PDF
readers.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>