[Libreoffice-bugs] [Bug 124191] Text copied from a PDF exported using Linux Libertine G Graphite font is missing characters. (comment 24)

bugzilla-daemon at bugs.documentfoundation.org bugzilla-daemon at bugs.documentfoundation.org
Fri Mar 22 19:18:26 UTC 2019


https://bugs.documentfoundation.org/show_bug.cgi?id=124191

--- Comment #37 from Khaled Hosny (inactive) <khaledhosny at eglug.org> ---
(In reply to V Stuart Foote from comment #31)
> (In reply to Khaled Hosny (inactive) from comment #21)
> > > Here is result from a 6.2.1 build--note addition of the /ActualText
> > > structure, which helps with fidelity of pasted text. But that the
> > > LibreOffice generated /ToUnicode does look to have problems.
> 
> > That is fine, it means there is no unique one to one, or one to many mapping
> > between these glyphs (not characters) and the input text, so no /ToUnicode
> > and /ActualText tagging is used for them.
> 
> While things are much improved with HarfBuzz and moving the font handling
> into CommonSalLayout. But I'm still not sure this is correct, at least not
> in handling digraphs for the Graphite fonts. 
> 
> When LO exports to PDF the mapping of "The fire flying coffee left
> Quickly.", with Graphite font(s), the /ToUnicode stuct is getting an
> additional glyph added to the digraphs (both PUA and , and then is not
> mapping that glyph when it probably should.
> 
> Use the below /ToUnicode chart with annotations, and read out the Tf[.*]TJ
> text runs (from LO 6.2.1) in comment 16
> 
> <01> <005400680065>  --> "The", but maybe should be just "Th"?
> x <02> -- "e" not mapped

That is how the fonts are built:
$ hb-shape /usr/share/fonts/TTF/LinLibertine_R_G.ttf "The fire" --no-positions
[T_h=0|e=0|space=3|f_i=4|r=4|e=7]

The numbers after each glyph is the index of the character in belongs to in the
input string. Here both <T_h> and <e> glyphs get index 0 and the next glyph,
<space>, gets index 3. So for us this means that, the first three characters,
“the”, make a two glyph cluster, <T_h><e>, and we can’t tell which of the three
characters belongs to which glyph and thus bundle them as a single unit.

Now, /ToUnicode allows only one to one and one to many mappings, but not many
to many that we need here, so we use an /ActualText tag.

For maximum compatibility with PDF readers not supporting /ActualText we also
add, as a last resort, a /ToUnicode entry for the first glyph, <T_h>, mapping
it to the three characters and skip the second glyph <e>. This is not ideal,
but at least one gets some text (and spurious chars for the unmapped glyphs) on
such readers.

So basically it is a faulty font, that <e> should have gotten index 2 not 0,
and us are doing our best to accommodate limitations of PDF format and PDF
readers.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice-bugs/attachments/20190322/d74f39f6/attachment.html>


More information about the Libreoffice-bugs mailing list