[poppler] pdftohtml sometimes produces invalid xml

Børre Gaup borre.gaup at uit.no
Mon Mar 9 12:36:12 PDT 2015


Hi!

pdftohtml -xml sometimes produces invalid xml, resulting in lines like this:

<text top="218" left="142" width="532" height="16" font="1"><i>lea <b>sadji 
</b></i>(korreláhta),<b> <i>gosa mannat </b>/ Minulla on <b>paikka</b> 
</i>(korreláhta)<b>, <i>jonne </i></b></text>

In our collection of 1078 pdfs, pdftohtml produces 11 documents with this  
'opening and ending tag mismatch' error.

I did some changes in utils/HtmlOutputDev.cc that make those 11 documents 
wellformed and do not break the wellformedness of the other documents.

The changes I did is found here: 
https://github.com/albbas/poppler/compare/fix_xml_wellformedness

I also made a diff (8174 lines) which shows what kind of changes this version 
makes on our 1078 pdf documents compared to pdftothml 0.30.0.  That diff is 
found here: https://github.com/albbas/poppler/blob/fix_xml_wellformedness/all-pdf.diff

Would you be interested in incorporating these changes into the main branch?

Regards,
Børre Gaup



More information about the poppler mailing list