[poppler] pdftohtml produces invalid XML

Piotr Findeisen piotr.findeisen at azouk.com
Wed Nov 4 00:20:03 PST 2009


Hello!   

On 03.11.2009 23:38, Reece Dunn wrote:
>
>> # wget -q
>> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>>     pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
>>     python -c 'from xml.parsers.expat import ParserCreate;
>> ParserCreate().ParseFile(open("x.xml"))'
>>     
> I'm not sure what the fix is, but the line with the error is:
>     <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨¥§¦¨ ¦</text>
> and firefox gives:
>     <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨¥§¦¨ ¦</text>
>     ---------------------------------------------------------------^
> (that is -- it is choking on the [00|11] character; there are also
> other chatacters in the latin-1 control character range (c < 0x20)).
>   
Right. 0x11 is the /first/ one to cause problem with python xml parser.


> My initial thought is that the characters are referencing the Unicode
> codepoints (e.g. in the U+2100 range). However, these all appear to be
> in the ascii range (i.e. not multi-byte UTF-8 as the encoding
> suggests, but I may be wrong as there look to be more characters than
> what is displayed).
>   
these problematic characters are all ASCII control characters


> Instead, they look like they are codepoints into a special
> mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
> box to hand at the moment, so can't verify the font name)). This would
> make sense given the font="7" attribute and the seemingly random
> characters. And given the greater number of characters, this looks to
> be using a non-URF8 multi-byte encoding.
>   
font="7" attribute is generated by "pdftohtml -xml" and it's reference to
<font id="7" ...... /> element near the top of the produced XML document

And yes, there is some font mapping involved. I tried and wrote the
equation in a new .tex document, but produced PDF contained only
characters I know & read.
No matter how i produced PDF — pdflatex, latex & dvipdf, etc.
> Someone will need to dig around in the htmltopdf code and the
> rendering of non-ascii characters.
>   
I agree this is where the problem begins, though I've never seen
pdftohtml's source...

best regards,
Piotr


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20091104/c3178ff5/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : http://lists.freedesktop.org/archives/poppler/attachments/20091104/c3178ff5/attachment.pgp 


More information about the poppler mailing list