[Poppler-bugs] [Bug 37900] pdftotext -htmlmeta and pdftohtml fail to decode U+2019

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sat Jun 4 03:24:37 PDT 2011


https://bugs.freedesktop.org/show_bug.cgi?id=37900

--- Comment #2 from Steven Murdoch <sjm217-freedesktop at srcf.ucam.org> 2011-06-04 03:24:37 PDT ---
Created an attachment (id=47513)
 --> (https://bugs.freedesktop.org/attachment.cgi?id=47513)
Test case demonstrating problem with U+2019 in title

Attached as requested (generated by Word 2007 + Acrobat 9, the same as the
document that was actually causing the problem).

$ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less
...
00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974  ml">.<head>.<tit
00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f  le>Test of pdfto
00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369  text..s conversi
00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74  on of U+2019.</t
...

[0xc2 0x90 is the UTF-8 encoding of U+0090]

$ pdfinfo /tmp/u2019test.pdf | xxd | less

...
0000000: 5469 746c 653a 2020 2020 2020 2020 2020  Title:          
0000010: 5465 7374 206f 6620 7064 6674 6f74 6578  Test of pdftotex
0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e  t...s conversion
0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468   of U+2019..Auth
0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32  or:         sjm2
...

[0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list