[Poppler-bugs] [Bug 32522] Some letters are in wrong order in the output of pdftotext

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Dec 20 14:00:32 PST 2010


https://bugs.freedesktop.org/show_bug.cgi?id=32522

--- Comment #2 from Adrian Johnson <ajohnson at redneon.com> 2010-12-20 14:00:32 PST ---
Loooking at the PDF, the string is printed with this operation:

  <00CD 00A3 0095 0070 00B4 002A >Tj

I added the spaces for readability.

the toUnicode map is:

  6 beginbfchar
  <002A> <0627>
  <0070> <062D>
  <0095> <0644>
  <00A3> <064A>
  <00B4> <06440645>
  <00CD> <064A0646>
  endbfchar

so when the text is extracted the sequence of unicode is:

  064A0646 064A 0644 062D 06440645 0627

the output from 

  pdftotext -enc UCS-2 049.pdf - | hexdump -C

is

  00000000  20 2b 06 27 06 45 06 44  06 2d 06 44 06 4a 06 46  |
+.'.E.D.-.D.J.F|
  00000010  06 4a 20 2c 00 0a 00 0a  00 0c                    |.J ,......|

pdftotext has output the unicode characters in reverse order as you
would expect for a RTL script. It looks like the glyphs that mapped to
two unicode characters have their characters reversed.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list