[Poppler-bugs] [Bug 24890] New: pdftohtml -xml produces invalid XML markup
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Wed Nov 4 00:44:18 PST 2009
http://bugs.freedesktop.org/show_bug.cgi?id=24890
Summary: pdftohtml -xml produces invalid XML markup
Product: poppler
Version: unspecified
Platform: Other
URL: http://www.tml.tkk.fi/Studies/T-
110.557/2002/papers/burlacu_mihai.pdf
OS/Version: All
Status: NEW
Severity: normal
Priority: high
Component: general
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: piotr.findeisen at gmail.com
Created an attachment (id=30952)
--> (http://bugs.freedesktop.org/attachment.cgi?id=30952)
PDF that causes pdftohtml to produce invalid XML
On certain PDF files the pdftohtml utility run with '-xml' option produces XML
that is invalid and cannot be parsed by strict compliant parsers.
Test case:
# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf
&& \
pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
python -c 'from xml.parsers.expat import ParserCreate;
ParserCreate().ParseFile(open("x.xml"))'
With pdftohtml's versions 0.6, 0.10, 0.12 and python version 2.5 it produces:
Page-1
Traceback (most recent call last):
File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column
63
The offending byte is an ASCII control byte (0x11). It's followed by other
ASCII control bytes.
I attached the burlacu_mihai.pdf for reference here in case is becomes not
available on the original server. I'll contact the copyright owner for the
acknowledgment and will remove this file in case the author doesn't allow
redistribution of the file.
--
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list