[Poppler-bugs] [Bug 18460] New: pdftohtml puts garbage into <title> tags and " Document Outline"

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Nov 9 16:38:39 PST 2008


http://bugs.freedesktop.org/show_bug.cgi?id=18460

           Summary: pdftohtml  puts garbage into <title> tags and "Document
                    Outline"
           Product: poppler
           Version: unspecified
          Platform: Other
               URL: http://www.maths.mq.edu.au/~ross/poppler/ZhangPeng/readm
                    e.html
        OS/Version: Mac OS X (All)
            Status: NEW
          Severity: minor
          Priority: low
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: ross at maths.mq.edu.au


Created an attachment (id=20170)
 --> (http://bugs.freedesktop.org/attachment.cgi?id=20170)
Zhang Peng's PDF; it contains a chinese font

With the attached PDF, (supplied by Zhang Peng for another purpose)
      http://lists.freedesktop.org/archives/poppler/2008-November/004216.html 
pdftohtml  fails to set the <title> tags correctly, resulting in invalid UTF8
bytes  <FE><FF> .

Within the "Document Outline" section, both entries start this way, with the
first
being followed by more garbage.

This can be seen at the URL stated for this bug report:
     http://www.maths.mq.edu.au/~ross/poppler/ZhangPeng/readme.html
(You may need to set the encoding manually to UTF8.)


Facts:
-----
The document contains chinese characters, with the following font info:

<</Subtype/Type0
/DescendantFonts 33 0 R
/BaseFont/AdobeSongStd-Light
/Encoding/UniGB-UCS2-H
/Type/Font>>

There is no embedded CMap resource:

> pdffonts readme.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
AdobeSongStd-Light                   CID Type 0        no  no  no      32  0


Observations:
-----------
    (see also 
http://lists.freedesktop.org/archives/poppler/2008-November/004220.html)

     pdftotext  worked fine for me,
       both with Poppler v0.8.2  and  Poppler v0.10.0

   However there were problems with  readme.pdf
   when using other software.

   e.g.,  Adobe Reader v8.1.0 and v9.0.0
       both showed just blank pages;

        Adobe Acrobat Pro v8.1.2
       displayed the PDF just fine

        Preview (MacOS X, v10.4.11)
       displayed the PDF just fine


   pdftohtml  translated the PDF to a 2-page HTML, with frames
       *but* there were some errors.


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list