[poppler] pdftohtml sometimes produces invalid xml

Albert Astals Cid aacid at kde.org
Wed Mar 11 15:50:27 PDT 2015


El Dimarts, 10 de març de 2015, a les 10:17:12, Børre Gaup va escriure:
> Tirsdag 10.  mars 2015 00.45.49 skrev Albert Astals Cid:
> > El Dilluns, 9 de març de 2015, a les 20:36:12, Børre Gaup va escriure:
> > > Hi!
> > 
> > Hi
> > 
> > > pdftohtml -xml sometimes produces invalid xml, resulting in lines like
> > > this:
> > > 
> > > <text top="218" left="142" width="532" height="16" font="1"><i>lea
> > > <b>sadji
> > > </b></i>(korreláhta),<b> <i>gosa mannat </b>/ Minulla on <b>paikka</b>
> > > </i>(korreláhta)<b>, <i>jonne </i></b></text>
> > > 
> > > In our collection of 1078 pdfs, pdftohtml produces 11 documents with
> > > this
> > > 'opening and ending tag mismatch' error.
> > > 
> > > I did some changes in utils/HtmlOutputDev.cc that make those 11
> > > documents
> > > wellformed and do not break the wellformedness of the other documents.
> > > 
> > > The changes I did is found here:
> > > https://github.com/albbas/poppler/compare/fix_xml_wellformedness
> > > 
> > > I also made a diff (8174 lines) which shows what kind of changes this
> > > version makes on our 1078 pdf documents compared to pdftothml 0.30.0.
> > > That
> > > diff is found here:
> > > https://github.com/albbas/poppler/blob/fix_xml_wellformedness/all-pdf.di
> > > ff
> > > 
> > > Would you be interested in incorporating these changes into the main
> > > branch?
> > 
> > Can you please link to a pdf with such error (if you don't have an
> > internet
> > link i'd suggest opening a bug in bugs.freedesktop.org and attaching both
> > the patch and the file there).
> 
> Here are a couple of links:
> http://www.samediggi.se/31961
> http://www.samisk.no/attachments/129_Tjaalege_%20J%C3%AFengesne%20h%C3%A5agk
> odh.pdf

Could you actually please open a bug? It's much easier for me to track all the 
missing things i have to do with bug  numbers than over mailign list subjects.

Thanks!

Albert

> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list