[Poppler-bugs] [Bug 24890] New: pdftohtml -xml produces invalid XML markup

Wed Nov 4 00:44:18 PST 2009

http://bugs.freedesktop.org/show_bug.cgi?id=24890

           Summary: pdftohtml -xml produces invalid XML markup
           Product: poppler
           Version: unspecified
          Platform: Other
               URL: http://www.tml.tkk.fi/Studies/T-
                    110.557/2002/papers/burlacu_mihai.pdf
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: high
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: piotr.findeisen at gmail.com

Created an attachment (id=30952)
 --> (http://bugs.freedesktop.org/attachment.cgi?id=30952)
PDF that causes pdftohtml to produce invalid XML

On certain PDF files the pdftohtml utility run with '-xml' option produces XML
that is invalid and cannot be parsed by strict compliant parsers.

Test case:
# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf
&& \
    pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
    python -c 'from xml.parsers.expat import ParserCreate;
ParserCreate().ParseFile(open("x.xml"))'

With pdftohtml's versions 0.6, 0.10, 0.12 and python version 2.5 it produces:
Page-1
Traceback (most recent call last):
    File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column
63

The offending byte is an ASCII control byte (0x11). It's followed by other
ASCII control bytes.

I attached the burlacu_mihai.pdf for reference here in case is becomes not
available on the original server. I'll contact the copyright owner for the
acknowledgment and will remove this file in case the author doesn't allow
redistribution of the file.

-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.