[poppler] pdftohtml produces invalid XML

Reece Dunn msclrhd at googlemail.com
Tue Nov 3 14:38:57 PST 2009


2009/11/3 Piotr Findeisen <piotr.findeisen at azouk.com>:
> Hi!
>
> I started using pdftohtml form Debian's poppler-utils package for document
> analysis and run across a problem that `pdftohtml -xml' can produce invalid
> XML on output (at least invalid for python xml tools).
>
> Test case:
>
> # wget -q
> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>     pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
>     python -c 'from xml.parsers.expat import ParserCreate;
> ParserCreate().ParseFile(open("x.xml"))'

I'm not sure what the fix is, but the line with the error is:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
and firefox gives:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
    ---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).

This will cause any xml parser to choke, as the characters are
invalid. What I don't know is why/how these are appearing in
pdftohtml.

Looking at the PDF in okular (which appears to render the PDF
correctly there), shows a mathematical equation for the faulty lines,
specifically:

<text top="606" left="101" width="173" height="10" font="6">Digital
signal processing basic formula:</text>
<text top="632" left="101" width="25" height="10" font="6">y(t) =</text>
<text top="626" left="133" width="0" height="0" font="7"> </text>
<text top="631" left="133" width="0" height="0" font="7">¡</text>
<text top="647" left="128" width="0" height="0" font="7">¢</text>
<text top="646" left="134" width="11" height="0" font="7"> ¤£</text>
<text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>

should be (in the proper math layout for this formula):

    y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du

where the h(u)x(t - u)du bit is in the stylised script used in maths.

My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).

Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font="7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.

Someone will need to dig around in the htmltopdf code and the
rendering of non-ascii characters.

HTH,
- Reece


More information about the poppler mailing list