[Poppler-bugs] [Bug 37900] pdftotext -htmlmeta and pdftohtml fail to decode U+2019
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Sat Jun 4 03:24:37 PDT 2011
https://bugs.freedesktop.org/show_bug.cgi?id=37900
--- Comment #2 from Steven Murdoch <sjm217-freedesktop at srcf.ucam.org> 2011-06-04 03:24:37 PDT ---
Created an attachment (id=47513)
--> (https://bugs.freedesktop.org/attachment.cgi?id=47513)
Test case demonstrating problem with U+2019 in title
Attached as requested (generated by Word 2007 + Acrobat 9, the same as the
document that was actually causing the problem).
$ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less
...
00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974 ml">.<head>.<tit
00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f le>Test of pdfto
00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369 text..s conversi
00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74 on of U+2019.</t
...
[0xc2 0x90 is the UTF-8 encoding of U+0090]
$ pdfinfo /tmp/u2019test.pdf | xxd | less
...
0000000: 5469 746c 653a 2020 2020 2020 2020 2020 Title:
0000010: 5465 7374 206f 6620 7064 6674 6f74 6578 Test of pdftotex
0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e t...s conversion
0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468 of U+2019..Auth
0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32 or: sjm2
...
[0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list