[poppler] pdftohtml sometimes produces invalid xml

Albert Astals Cid aacid at kde.org
Mon Mar 9 16:45:49 PDT 2015


El Dilluns, 9 de març de 2015, a les 20:36:12, Børre Gaup va escriure:
> Hi!

Hi

> 
> pdftohtml -xml sometimes produces invalid xml, resulting in lines like this:
> 
> <text top="218" left="142" width="532" height="16" font="1"><i>lea <b>sadji
> </b></i>(korreláhta),<b> <i>gosa mannat </b>/ Minulla on <b>paikka</b>
> </i>(korreláhta)<b>, <i>jonne </i></b></text>
> 
> In our collection of 1078 pdfs, pdftohtml produces 11 documents with this
> 'opening and ending tag mismatch' error.
> 
> I did some changes in utils/HtmlOutputDev.cc that make those 11 documents
> wellformed and do not break the wellformedness of the other documents.
> 
> The changes I did is found here:
> https://github.com/albbas/poppler/compare/fix_xml_wellformedness
> 
> I also made a diff (8174 lines) which shows what kind of changes this
> version makes on our 1078 pdf documents compared to pdftothml 0.30.0.  That
> diff is found here:
> https://github.com/albbas/poppler/blob/fix_xml_wellformedness/all-pdf.diff
> 
> Would you be interested in incorporating these changes into the main branch?

Can you please link to a pdf with such error (if you don't have an internet 
link i'd suggest opening a bug in bugs.freedesktop.org and attaching both the 
patch and the file there).

Cheers,
  Albert

> 
> Regards,
> Børre Gaup
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list