[poppler] pdftohtml produces invalid XML
Piotr Findeisen
piotr.findeisen at azouk.com
Tue Nov 3 03:48:46 PST 2009
Hi!
I started using pdftohtml form Debian's poppler-utils package for
document analysis and run across a problem that `pdftohtml -xml' can
produce invalid XML on output (at least invalid for python xml tools).
Test case:
# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
python -c 'from xml.parsers.expat import ParserCreate; ParserCreate().ParseFile(open("x.xml"))'
Page-1
Traceback (most recent call last):
File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column 63
the problematic character is \x11
I'm running version 0.12 of pdftohtml, installed from Debian
poppler-utils_0.12.0-2_i386 package.
pdftohtml -v
pdftohtml version 0.12.0
Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2004 Glyph & Cog, LLC
how can i workaround this problem?
best regards,
Piotr Findeisen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20091103/0282b448/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : http://lists.freedesktop.org/archives/poppler/attachments/20091103/0282b448/attachment.pgp
More information about the poppler
mailing list