[poppler] pdftohtml produces invalid XML

Albert Astals Cid aacid at kde.org
Tue Nov 3 13:43:08 PST 2009


A Dimarts, 3 de novembre de 2009, Piotr Findeisen va escriure:
> Hi!
> 
> I started using pdftohtml form Debian's poppler-utils package for
> document analysis and run across a problem that `pdftohtml -xml' can
> produce invalid XML on output (at least invalid for python xml tools).
> 
> Test case:
> 
>     # wget -q
>  http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>  pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python
>  -c 'from xml.parsers.expat import ParserCreate;
>  ParserCreate().ParseFile(open("x.xml"))'
> 
>     Page-1
>     Traceback (most recent call last):
>       File "<string>", line 2, in <module>
>     xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45,
>  column 63
> 
> the problematic character is \x11
> 
> I'm running version 0.12 of pdftohtml, installed from Debian
> poppler-utils_0.12.0-2_i386 package.
> 
>     pdftohtml -v
>     pdftohtml version 0.12.0
>     Copyright 2005-2009 The Poppler Developers -
>  http://poppler.freedesktop.org Copyright 1999-2003 Gueorgui Ovtcharov and
>  Rainer Dorsch
>     Copyright 1996-2004 Glyph & Cog, LLC

Can you please post a but at bugs.freedesktop.org?

> how can i workaround this problem?

You can code a patch or wait until someone fixes it.

Albert

> best regards,
> Piotr Findeisen
> 



More information about the poppler mailing list