[poppler] pdftohtml produces invalid XML

Tue Nov 3 14:49:44 PST 2009

2009/11/3 Reece Dunn <msclrhd at googlemail.com>:
> 2009/11/3 Piotr Findeisen <piotr.findeisen at azouk.com>:
>> Hi!
>>
>> I started using pdftohtml form Debian's poppler-utils package for document
>> analysis and run across a problem that `pdftohtml -xml' can produce invalid
>> XML on output (at least invalid for python xml tools).
>>
>> Test case:
>>
>> # wget -q
>> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>>     pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
>>     python -c 'from xml.parsers.expat import ParserCreate;
>> ParserCreate().ParseFile(open("x.xml"))'
>
> I'm not sure what the fix is, but the line with the error is:
>    <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
> and firefox gives:
>    <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
>    ---------------------------------------------------------------^
> (that is -- it is choking on the [00|11] character; there are also
> other chatacters in the latin-1 control character range (c < 0x20)).
>
> This will cause any xml parser to choke, as the characters are
> invalid. What I don't know is why/how these are appearing in
> pdftohtml.
>
> Looking at the PDF in okular (which appears to render the PDF
> correctly there), shows a mathematical equation for the faulty lines,
> specifically:
>
> <text top="606" left="101" width="173" height="10" font="6">Digital
> signal processing basic formula:</text>
> <text top="632" left="101" width="25" height="10" font="6">y(t) =</text>
> <text top="626" left="133" width="0" height="0" font="7"> </text>
> <text top="631" left="133" width="0" height="0" font="7">¡</text>
> <text top="647" left="128" width="0" height="0" font="7">¢</text>
> <text top="646" left="134" width="11" height="0" font="7"> ¤£</text>
> <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
>
> should be (in the proper math layout for this formula):
>
>    y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du
>
> where the h(u)x(t - u)du bit is in the stylised script used in maths.
>
> My initial thought is that the characters are referencing the Unicode
> codepoints (e.g. in the U+2100 range). However, these all appear to be
> in the ascii range (i.e. not multi-byte UTF-8 as the encoding
> suggests, but I may be wrong as there look to be more characters than
> what is displayed).
>
> Instead, they look like they are codepoints into a special
> mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
> box to hand at the moment, so can't verify the font name)). This would
> make sense given the font="7" attribute and the seemingly random
> characters. And given the greater number of characters, this looks to
> be using a non-URF8 multi-byte encoding.
>
> Someone will need to dig around in the htmltopdf code and the
> rendering of non-ascii characters.

As a follow-up...

Not using the -xml option of pdftotext causes it to write a html file
that is similarly mangled w.r.t. the characters in the formula (from
the integral to the du differential component).

In addition to this, the layout does not match the formula for the
integral (not sure whether the ¢ is meant to be the integral sign or
not; if it is supposed to be the infinity sign, it should be above the
integral, not below it) and the font size is not consistent with the
"y(t) = " part. These rendering issues are obviously orthogonal to the
encoding issue.

- Reece