[Poppler-bugs] [Bug 24890] pdftohtml -xml produces invalid XML markup

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Nov 4 00:58:26 PST 2009


http://bugs.freedesktop.org/show_bug.cgi?id=24890





--- Comment #2 from Reece H. Dunn <msclrhd at gmail.com>  2009-11-04 00:58:25 PST ---
In the generated html and/or xml file (the html output is similarly affected),
control characters (and other invalid xml characters) are being written out to
the file.

The XML version will cause an XML parser to generate an error when it
encounters these characters (it first chokes on a 0x11 control character).

The HTML version loads the page, but displays garbage instead of the integral
equation (same as viewing either file in a text editor).

a/ My initial thought was that the characters (such as the integral sign) in
the integral part of the equation were being written as Unicode. For example,
U+222B is the Unicode code point for the integral sign (∫) [1]. If that were
the case, the UTF-8 forms would be written out and it would form valid XML and
HTML output.

b/ My next thought was that the control characters in the ascii and Unicode
code pages corresponded to the correct glyphs when using font 7 (e.g. if the
font was a special mathematical font). This is not right, as the font is the
Times font (looking at the font defiinition), and the number of characters in
that element don't match the number of characters in the equation (thus
indicating a multi-byte encoding is being used).

c/ Another possibility that has occurred to me is that what is shown is the raw
UTF-8 byte sequence, and that is being UTF-8 encoded! Removing the UTF-8
encoding in the html version, I get " ¤£" instead of " ¤£" for one
of the outputted text nodes.

My current thinking (without looking at the code yet, just from examining the
behaviour) is that (c) is the most probable cause of this error.

[1] http://en.wikipedia.org/wiki/Integral_symbol


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list