<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#ffffff"> Hello! On 03.11.2009 23:38, Reece Dunn wrote: <blockquote cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com" type="cite"> <blockquote type="cite"> <pre wrap=""> # wget -q <a class="moz-txt-link-freetext" href="http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf">http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf</a> && \ pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python -c 'from xml.parsers.expat import ParserCreate; ParserCreate().ParseFile(open("x.xml"))' </pre> </blockquote> <pre wrap=""> I'm not sure what the fix is, but the line with the error is: <text top="632" left="152" width="58" height="0" font="7">¥§¦©¨¥§¦¨ ¦</text> and firefox gives: <text top="632" left="152" width="58" height="0" font="7">¥§¦©¨¥§¦¨ ¦</text> ---------------------------------------------------------------^ (that is -- it is choking on the [00|11] character; there are also other chatacters in the latin-1 control character range (c < 0x20)). </pre> </blockquote> Right. 0x11 is the first one to cause problem with python xml parser. <blockquote cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com" type="cite"> <pre wrap=""> My initial thought is that the characters are referencing the Unicode codepoints (e.g. in the U+2100 range). However, these all appear to be in the ascii range (i.e. not multi-byte UTF-8 as the encoding suggests, but I may be wrong as there look to be more characters than what is displayed). </pre> </blockquote> these problematic characters are all ASCII control characters <blockquote cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com" type="cite"> <pre wrap=""> Instead, they look like they are codepoints into a special mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows box to hand at the moment, so can't verify the font name)). This would make sense given the font="7" attribute and the seemingly random characters. And given the greater number of characters, this looks to be using a non-URF8 multi-byte encoding. </pre> </blockquote> font="7" attribute is generated by "pdftohtml -xml" and it's reference to element near the top of the produced XML document And yes, there is some font mapping involved. I tried and wrote the equation in a new .tex document, but produced PDF contained only characters I know & read. No matter how i produced PDF — pdflatex, latex & dvipdf, etc. <blockquote cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com" type="cite"> <pre wrap=""> Someone will need to dig around in the htmltopdf code and the rendering of non-ascii characters. </pre> </blockquote> I agree this is where the problem begins, though I've never seen pdftohtml's source... best regards, Piotr </body> </html>