No subject


Wed Nov 4 00:20:15 PST 2009


I'm not sure what the fix is, but the line with the error is:
   <text top=3D"632" left=3D"152" width=3D"58" height=3D"0"
font=3D"7">=C2=A5=C2=A7=C2=A6=C2=A9=C2=A8   =C2=A5=C2=A7    =C2=A6 =C2=A8  =
 =C2=A6</text>
and firefox gives:
   <text top=3D"632" left=3D"152" width=3D"58" height=3D"0"
font=3D"7">=C2=A5=C2=A7=C2=A6=C2=A9=C2=A8   =C2=A5=C2=A7    =C2=A6 =C2=A8  =
 =C2=A6</text>
   ---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).

This will cause any xml parser to choke, as the characters are
invalid. What I don't know is why/how these are appearing in
pdftohtml.

Looking at the PDF in okular (which appears to render the PDF
correctly there), shows a mathematical equation for the faulty lines,
specifically:

<text top=3D"606" left=3D"101" width=3D"173" height=3D"10" font=3D"6">Digit=
al
signal processing basic formula:</text>
<text top=3D"632" left=3D"101" width=3D"25" height=3D"10" font=3D"6">y(t) =
=3D</text>
<text top=3D"626" left=3D"133" width=3D"0" height=3D"0" font=3D"7"> </text>
<text top=3D"631" left=3D"133" width=3D"0" height=3D"0" font=3D"7">=C2=A1</=
text>
<text top=3D"647" left=3D"128" width=3D"0" height=3D"0" font=3D"7">=C2=A2</=
text>
<text top=3D"646" left=3D"134" width=3D"11" height=3D"0" font=3D"7"> =C2=A4=
=C2=A3</text>
<text top=3D"632" left=3D"152" width=3D"58" height=3D"0"
font=3D"7">=C2=A5=C2=A7=C2=A6=C2=A9=C2=A8   =C2=A5=C2=A7    =C2=A6 =C2=A8  =
 =C2=A6</text>

should be (in the proper math layout for this formula):

   y(t) =3D integral [above: inf, below: -inf] h(u)x(t - u)du

where the h(u)x(t - u)du bit is in the stylised script used in maths.

My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).

Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font=3D"7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.


--=20
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=3Demail
------- You are receiving this mail because: -------
You are the assignee for the bug.=


More information about the Poppler-bugs mailing list