[poppler] XML syntax error in PdfToText tool

Albert Astals Cid aacid at kde.org
Fri Nov 15 11:36:06 PST 2013


El Divendres, 15 de novembre de 2013, a les 19:04:11, suzuki toshiya va 
escriure:
> How about this?

Makes sense.

Commited.

Cheers,
  Albert

> 
> Regards,
> mpsuzuki
> 
> On 11/15/2013 04:26 PM, suzuki toshiya wrote:
> > I'm trying to fix this issue by an insertion of myXmlTokenReplace()
> > into printInfoString().
> > 
> > Regards,
> > mpsuzuki
> > 
> > On 11/14/2013 10:42 PM, Paweł Leń wrote:
> >> This is the contents of file output.xml generated by command pdftotext
> >> -bbox -htmlmeta 'myfile.pdf' 'output.xml' :
> >> 
> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
> >> xmlns="http://www.w3.org/1999/xhtml"> <head>
> >> <title>Microsoft Word -
> >> Preface&Contents_Advances_in_Lasers_and_Electro_Optics.doc</title> <meta
> >> name="Author" content="Teodora"/>
> >> <meta name="Creator" content="PScript5.dll Version 5.2.2"/>
> >> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
> >> <meta name="CreationDate" content=""/>
> >> </head>
> >> <body>
> >> <doc>
> >> 
> >>    <page width="482.000000" height="680.000000">
> >>    
> >>      <word xMin="255.120000" yMin="190.576860" xMax="338.055540"
> >>      yMax="207.269700">Advances</word> <word xMin="344.000562"
> >>      yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word>
> >>      <word xMin="365.276724" yMin="190.576860" xMax="425.239584"
> >>      yMax="207.269700">Lasers</word> <word xMin="256.260624"
> >>      yMin="207.256884" xMax="288.954240" yMax="223.949724">and</word>
> >>      <word xMin="294.884844" yMin="207.256884" xMax="363.168492"
> >>      yMax="223.949724">Electro</word> <word xMin="369.099096"
> >>      yMin="207.256884" xMax="425.265216" 
yMax="223.949724">Optics</word>>>    
> >>    </page>
> >> 
> >> </doc>
> >> </body>
> >> </html>
> >> 
> >> 
> >> As You can see in line 3 tag <title> contains invalid character squence
> >> with "&".  The title is extracted from myfile.pdf. CDATA or some kind of
> >> htmlspecialchars is needed.
> >> 
> >> 
> >> 
> >> 
> >> *--
> >> *
> >> 
> >> *Paweł Leń*
> >> 
> >> 
> >> 
> >> 2013/11/14 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp
> >> <mailto:mpsuzuki at hiroshima-u.ac.jp>>>> 
> >>     Hi,
> >>     
> >>     If you could post a sample XML file that you modified the
> >>     output of pdftotext to fit the XML parser, it would be
> >>     helpful for some kind people to develop a patch.
> >>     
> >>     Regards,
> >>     mpsuzuki
> >>     
> >>     On 11/14/2013 10:04 PM, Paweł Leń wrote:
> >>         Hello,
> >>         
> >>         I have error when running:
> >>         pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'
> >>         
> >>         The output xml have <title> tag on the begining of document (meta
> >>         section), error appears when title contains "&" character. Title
> >>         field has no CDATA and it is not quoted so it causes error in my
> >>         xmllib parser. Can I (or You :) ) fix it somehow?
> >>         
> >>         Beast regards
> >>         
> >>         *--
> >>         *
> >>         
> >>         *Paweł Leń*
> >>         
> >>         
> >>         
> >>         _________________________________________________
> >>         poppler mailing list
> >>         poppler at lists.freedesktop.org
> >>         <mailto:poppler at lists.freedesktop.org>
> >>         http://lists.freedesktop.org/__mailman/listinfo/poppler
> >>         <http://lists.freedesktop.org/mailman/listinfo/poppler>> 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list