[poppler] XML syntax error in PdfToText tool

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Fri Nov 15 02:04:11 PST 2013


How about this?

Regards,
mpsuzuki

On 11/15/2013 04:26 PM, suzuki toshiya wrote:
> I'm trying to fix this issue by an insertion of myXmlTokenReplace()
> into printInfoString().
>
> Regards,
> mpsuzuki
>
> On 11/14/2013 10:42 PM, Paweł Leń wrote:
>> This is the contents of file output.xml generated by command pdftotext -bbox -htmlmeta 'myfile.pdf' 'output.xml' :
>>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
>> <head>
>> <title>Microsoft Word - Preface&Contents_Advances_in_Lasers_and_Electro_Optics.doc</title>
>> <meta name="Author" content="Teodora"/>
>> <meta name="Creator" content="PScript5.dll Version 5.2.2"/>
>> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
>> <meta name="CreationDate" content=""/>
>> </head>
>> <body>
>> <doc>
>>    <page width="482.000000" height="680.000000">
>>      <word xMin="255.120000" yMin="190.576860" xMax="338.055540" yMax="207.269700">Advances</word>
>>      <word xMin="344.000562" yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word>
>>      <word xMin="365.276724" yMin="190.576860" xMax="425.239584" yMax="207.269700">Lasers</word>
>>      <word xMin="256.260624" yMin="207.256884" xMax="288.954240" yMax="223.949724">and</word>
>>      <word xMin="294.884844" yMin="207.256884" xMax="363.168492" yMax="223.949724">Electro</word>
>>      <word xMin="369.099096" yMin="207.256884" xMax="425.265216" yMax="223.949724">Optics</word>
>>    </page>
>> </doc>
>> </body>
>> </html>
>>
>>
>> As You can see in line 3 tag <title> contains invalid character squence with "&".  The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars is needed.
>>
>>
>>
>>
>> *--
>> *
>>
>> *Paweł Leń*
>>
>>
>>
>> 2013/11/14 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>>
>>
>>     Hi,
>>
>>     If you could post a sample XML file that you modified the
>>     output of pdftotext to fit the XML parser, it would be
>>     helpful for some kind people to develop a patch.
>>
>>     Regards,
>>     mpsuzuki
>>
>>
>>     On 11/14/2013 10:04 PM, Paweł Leń wrote:
>>
>>         Hello,
>>
>>         I have error when running:
>>         pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'
>>
>>         The output xml have <title> tag on the begining of document (meta section), error appears when title contains "&" character. Title field has no CDATA and it is not quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?
>>
>>         Beast regards
>>
>>         *--
>>         *
>>
>>         *Paweł Leń*
>>
>>
>>
>>         _________________________________________________
>>         poppler mailing list
>>         poppler at lists.freedesktop.org <mailto:poppler at lists.freedesktop.org>
>>         http://lists.freedesktop.org/__mailman/listinfo/poppler <http://lists.freedesktop.org/mailman/listinfo/poppler>
>>
>>
>>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdftotext_callXmlTokenReplaceInHtmlInfo.diff
Type: text/x-patch
Size: 708 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131115/ea983677/attachment.bin>


More information about the poppler mailing list