[Libreoffice-bugs] [Bug 89471] FILEOPEN pdf: When opening a PDF with RTL language text in Draw, text gets mirrored

bugzilla-daemon at bugs.documentfoundation.org bugzilla-daemon at bugs.documentfoundation.org
Wed Jun 19 02:33:19 UTC 2019


https://bugs.documentfoundation.org/show_bug.cgi?id=89471

--- Comment #23 from V Stuart Foote <vstuart.foote at utsa.edu> ---
(In reply to Eyal Rozenberg from comment #21)
> Oh, no no no!
> 
> We seem to have a huge misunderstanding with respect to this bug.
> 
Eyal, *

I will state again and quite clearly--LibreOffice is _NOT_ a PDF editor!

We can read it as a source document, opening into Writer, Impress, Calc, or
Draw. We can filter export to PDF from any document--but that would
overwrite/replace any source PDF, and only as os/DE allows.

We do not edit the PDF stream

We do not edit any of the PDF objects

ALL we do is read and filter import the PDF stream. 

We do not write back to the original source document and must swap in a
reconstructed PDF stream with any changes.

Either of our two import filters: pdfium based or poppler based keeps a copy of
the PDF source file, but always covert its contents for manipulation on the
LibreOffice canvas. We do not work directly on the "original" we do not "edit"
it!

That said, in practice the Poppler based import filter parses the object
streams from PDF and converts them into corresponding LibreOffice Draw
objects--Text boxes, Shapes, meta images, etc. Fidelity between the original
PDF objects and the import filter result varies depending on the object type
and if corresponding Draw object supports an attribute--clipping masks for
example (bug 86211).

The pdfium base import filter is configured to render content of the PDF as a
bitmap image with high fidelity to the document layout published in the PDF.
Currently it only handles the first page of a PDF 'inserted as image', with the
bitmap resolution set at just 96 dpi.

The issue here is that on filter import of the PDF--the object stream holding
text runs is added to a Draw text box. Withing the source PDF, some original
text will be broken into multiple text runs in multiple text objects.  

The text stream is sequenced as entered RTL, but as filter import is written
out to the Text box the run is written LTR--with no handling of the text run of
glyphs as RTL, or IIRC for more complex composite scripts.

LibreOffice uses extensively the ICU project
(https://en.wikipedia.org/wiki/International_Components_for_Unicode) for script
recognition and transliteration. But would seem text runs for non-western
scripts are not being supported--and we may not be using the ICU Unicode text
handling that is needed.

You'll note the pdfium filter (bug 89727) correctly handles the Hebrew and
Arabic text of the sample documents attached here.  But less you think that is
the solution for better fidelity and potential for "editing" PDF, like the
poppler based import filter, selecting the graphic object and 'breaking' out
its PDF stream objects results are not well rendered to document canvas--either
losing the Unicode glyph, or getting incorrect font fallback (or a mix).

As Khaled said--PDF is not a format intended to be edited. And, LibreOffice is
not a PDF editor. But we are mishandling RTL text runs and that needs to be
investigated.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice-bugs/attachments/20190619/d660ccd5/attachment-0001.html>


More information about the Libreoffice-bugs mailing list