[Poppler-bugs] [Bug 32522] Some letters are in wrong order in the output of pdftotext
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Mon Dec 20 14:00:32 PST 2010
https://bugs.freedesktop.org/show_bug.cgi?id=32522
--- Comment #2 from Adrian Johnson <ajohnson at redneon.com> 2010-12-20 14:00:32 PST ---
Loooking at the PDF, the string is printed with this operation:
<00CD 00A3 0095 0070 00B4 002A >Tj
I added the spaces for readability.
the toUnicode map is:
6 beginbfchar
<002A> <0627>
<0070> <062D>
<0095> <0644>
<00A3> <064A>
<00B4> <06440645>
<00CD> <064A0646>
endbfchar
so when the text is extracted the sequence of unicode is:
064A0646 064A 0644 062D 06440645 0627
the output from
pdftotext -enc UCS-2 049.pdf - | hexdump -C
is
00000000 20 2b 06 27 06 45 06 44 06 2d 06 44 06 4a 06 46 |
+.'.E.D.-.D.J.F|
00000010 06 4a 20 2c 00 0a 00 0a 00 0c |.J ,......|
pdftotext has output the unicode characters in reverse order as you
would expect for a RTL script. It looks like the glyphs that mapped to
two unicode characters have their characters reversed.
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list