<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - -xml outputs malformed xml"
href="https://bugs.freedesktop.org/show_bug.cgi?id=98305">98305</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>-xml outputs malformed xml
</td>
</tr>
<tr>
<th>Product</th>
<td>poppler
</td>
</tr>
<tr>
<th>Version</th>
<td>unspecified
</td>
</tr>
<tr>
<th>Hardware</th>
<td>x86-64 (AMD64)
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux (All)
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>medium
</td>
</tr>
<tr>
<th>Component</th>
<td>pdftohtml
</td>
</tr>
<tr>
<th>Assignee</th>
<td>poppler-bugs@lists.freedesktop.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>daniel.van.den.ouden@gmail.com
</td>
</tr></table>
<p>
<div>
<pre>Overview:
The following pdf causes pdftohtml to output malformed xml:
<a href="http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf">http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf</a>
The resulting xml file has multiple similar errors, the first one on line
71641:
<text top="180" left="71" width="101" height="15" font="11"><b>Sp<a
href="Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.html#876">eed
[MHz] </b>(3)</a></text>
(the closing b and a tags are not in the correct order)
Steps to Reproduce:
1) wget
<a href="http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf">http://www.atmel.com/images/Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf</a>
2) pdftohtml -q -i -xml
Atmel-8284-8-bit-AVR-microcontroller-ATmega169A_PA_329A_PA_3290A_PA_649A_P_6490A_P_datasheet.pdf
output.xml
Actual Results:
malformed xml
Expected Results:
well-formed xml. And I'm not quite sure if the link is placed on the
correct piece of text. In the pdf only the text "(3)" is clickable and none of
it is bold.
Build Date & Hardware:
Built on 2016-10-18 from source (0.48.0) on Ubunty 14.04 LTS
Additional Builds and Platforms:
Also occurred in the version of pdftohtml that was installed using apt-get
(0.28 if I recall correctly)
Cheers,
Daniel</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>