[Libreoffice-bugs] [Bug 119606] New: PDF: Arabic text gets deformed when creating a PDF in LibreOffice Writer
bugzilla-daemon at bugs.documentfoundation.org
bugzilla-daemon at bugs.documentfoundation.org
Thu Aug 30 12:25:49 UTC 2018
https://bugs.documentfoundation.org/show_bug.cgi?id=119606
Bug ID: 119606
Summary: PDF: Arabic text gets deformed when creating a PDF in
LibreOffice Writer
Product: LibreOffice
Version: 5.4.6.2 release
Hardware: x86-64 (AMD64)
OS: Linux (All)
Status: UNCONFIRMED
Severity: normal
Priority: medium
Component: Printing and PDF export
Assignee: libreoffice-bugs at lists.freedesktop.org
Reporter: vaaydayaasra at gmail.com
Description:
Creating a PDF from a document written in the Arabic script deforms the textual
content of the document, although it looks fine on the screen.
For example, see the attached PDF created with Writer 5.4.6.2 on Ubuntu 17.10,
where the example sentence "اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ" looks
as it should, but when you view it with any PDF reader, such as evince, copying
the text deforms most of the words. Some characters are clearly visible but
cannot be selected or searched (such as ى at the end of the first word اشترى).
If I search for the second word بلال, evince tells me there are no matches in
the document. The same happens when converting the file with pdftotext, which
produces the following output:
اشتر
للا مسة لفا كتاب وَأَنَا ْ
ه
اشت َ َريْتُهَا ِ
من ْ ُ
Here only two of the eight words are intact, the rest are garbled in one way or
another. If the text is in Latin script, both evince and pdftotext behave as
expected, meaning that the textual content is transferred correctly from Writer
to the PDF.
On LO 6.0.3.2 on Ubuntu 18.04, the textual content is preserved a little better
but it is still quite garbled. This is the output from pdftotext:
ه
اشترى للا خمسة آفا كتاب وَأنَا اشْ ت َ َريْتُهَا ِ
من ْ ُ
Here four out of the eight words are intact, and for example the last word of
the sentence is divided so that the last full character is found on the first
line and the rest on the third line. Some diacritics are found where they are
supposed to be, some others not.
MS Word 2007 handles this case better, although it's not perfect either. This
is the output from pdftotext:
اشترى بالل خمسة آالف كتاب وأنا اشتريتها منه
Here all diacritics are dropped and all sequences of ل (U+0644) + ا (U+0627)
are reversed turning لا into ال. Otherwise the sentence is intact.
This bug was first reported on Launchpad for LO 5.4.6.2 on Ubuntu 17.10 at:
https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/1772439 . After my
initial report, I have upgraded to LO 6.0.3.2 where the problem persists,
although the actual output is different. Another user on Launchpad confirmed
the bug on LO 6.0.3.2, as well.
Steps to Reproduce:
1. In a new Writer document, type some text in Arabic. My example sentence was:
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
2. Create a PDF.
3. Open the created PDF with a PDF reader (such as evince) and type one of the
words in the Search dialog, e.g. بلال. Alternatively select the word in the PDF
reader and copy-paste it somewhere else. You can also convert the PDF to text
using a utility like pdftotext.
Actual Results:
The PDF reader reports there are no matches for some of the words in the
document, although they are all clearly visible. Selecting and copy-pasting the
word garbles it. Pdftotext's output is garbled.
Expected Results:
All the words that are visible should also be searchable in a PDF reader,
copy-pasting should preserve the text, and the output of pdftotext should match
the original document.
Reproducible: Always
User Profile Reset: No
Additional Info:
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice-bugs/attachments/20180830/e440c5b5/attachment.html>
More information about the Libreoffice-bugs
mailing list