<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#ffffff">
Hi!<br>
<br>
I started using pdftohtml form Debian's poppler-utils package for
document analysis and run across a problem that `pdftohtml -xml' can
produce invalid XML on output (at least invalid for python xml tools).<br>
<br>
Test case:<br>
<blockquote>
<pre># wget -q <a class="moz-txt-link-freetext" href="http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf">http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf</a> && \
pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
python -c 'from xml.parsers.expat import ParserCreate; ParserCreate().ParseFile(open("x.xml"))'</pre>
<pre>Page-1
Traceback (most recent call last):
File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column 63</pre>
</blockquote>
the problematic character is \x11<br>
<br>
I'm running version 0.12 of pdftohtml, installed from Debian
poppler-utils_0.12.0-2_i386 package.<br>
<blockquote>
<pre>pdftohtml -v
pdftohtml version 0.12.0
Copyright 2005-2009 The Poppler Developers - <a class="moz-txt-link-freetext" href="http://poppler.freedesktop.org">http://poppler.freedesktop.org</a>
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2004 Glyph & Cog, LLC
</pre>
</blockquote>
<br>
how can i workaround this problem?<br>
best regards,<br>
Piotr Findeisen<br>
</body>
</html>