<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
Hello! <br>
<br>
On 03.11.2009 23:38, Reece Dunn wrote:
<blockquote
cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com"
type="cite"><br>
<blockquote type="cite">
<pre wrap="">
# wget -q
<a class="moz-txt-link-freetext" href="http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf">http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf</a> && \
pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
python -c 'from xml.parsers.expat import ParserCreate;
ParserCreate().ParseFile(open("x.xml"))'
</pre>
</blockquote>
<pre wrap="">
I'm not sure what the fix is, but the line with the error is:
<text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
and firefox gives:
<text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).
</pre>
</blockquote>
Right. 0x11 is the <i>first</i> one to cause problem with python xml
parser.<br>
<br>
<br>
<blockquote
cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com"
type="cite">
<pre wrap="">
My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).
</pre>
</blockquote>
these problematic characters are all ASCII control characters<br>
<br>
<br>
<blockquote
cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com"
type="cite">
<pre wrap="">
Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font="7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.
</pre>
</blockquote>
font="7" attribute is generated by "pdftohtml -xml" and it's reference
to<br>
<font id="7" ...... /> element near the top of the produced XML
document<br>
<br>
And yes, there is some font mapping involved. I tried and wrote the
equation in a new .tex document, but produced PDF contained only
characters I know & read.<br>
No matter how i produced PDF — pdflatex, latex & dvipdf, etc.<br>
<blockquote
cite="mid:3f4fd2640911031438p5104e575pac154bfc58fd2336@mail.gmail.com"
type="cite">
<pre wrap="">
Someone will need to dig around in the htmltopdf code and the
rendering of non-ascii characters.
</pre>
</blockquote>
I agree this is where the problem begins, though I've never seen
pdftohtml's source...<br>
<br>
best regards,<br>
Piotr<br>
<br>
<br>
</body>
</html>