[poppler] XML syntax error in PdfToText tool

Paweł Leń pawel.len at gmail.com
Fri Nov 29 06:58:20 PST 2013


Hello :)

Everything works fine, thank You very much!

Best Regards


*--*

*Paweł Leń*


2013/11/15 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>

> How about this?
>
> Regards,
> mpsuzuki
>
>
> On 11/15/2013 04:26 PM, suzuki toshiya wrote:
>
>> I'm trying to fix this issue by an insertion of myXmlTokenReplace()
>> into printInfoString().
>>
>> Regards,
>> mpsuzuki
>>
>> On 11/14/2013 10:42 PM, Paweł Leń wrote:
>>
>>> This is the contents of file output.xml generated by command pdftotext
>>> -bbox -htmlmeta 'myfile.pdf' 'output.xml' :
>>>
>>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
>>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="
>>> http://www.w3.org/1999/xhtml">
>>> <head>
>>> <title>Microsoft Word - Preface&Contents_Advances_in_
>>> Lasers_and_Electro_Optics.doc</title>
>>> <meta name="Author" content="Teodora"/>
>>> <meta name="Creator" content="PScript5.dll Version 5.2.2"/>
>>> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
>>> <meta name="CreationDate" content=""/>
>>> </head>
>>> <body>
>>> <doc>
>>>    <page width="482.000000" height="680.000000">
>>>      <word xMin="255.120000" yMin="190.576860" xMax="338.055540"
>>> yMax="207.269700">Advances</word>
>>>      <word xMin="344.000562" yMin="190.576860" xMax="359.331702"
>>> yMax="207.269700">in</word>
>>>      <word xMin="365.276724" yMin="190.576860" xMax="425.239584"
>>> yMax="207.269700">Lasers</word>
>>>      <word xMin="256.260624" yMin="207.256884" xMax="288.954240" yMax="
>>> 223.949724">and</word>
>>>      <word xMin="294.884844" yMin="207.256884" xMax="363.168492" yMax="
>>> 223.949724">Electro</word>
>>>      <word xMin="369.099096" yMin="207.256884" xMax="425.265216" yMax="
>>> 223.949724">Optics</word>
>>>    </page>
>>> </doc>
>>> </body>
>>> </html>
>>>
>>>
>>> As You can see in line 3 tag <title> contains invalid character squence
>>> with "&".  The title is extracted from myfile.pdf. CDATA or some kind of
>>> htmlspecialchars is needed.
>>>
>>>
>>>
>>>
>>> *--
>>> *
>>>
>>> *Paweł Leń*
>>>
>>>
>>>
>>> 2013/11/14 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp <mailto:
>>> mpsuzuki at hiroshima-u.ac.jp>>
>>>
>>>     Hi,
>>>
>>>     If you could post a sample XML file that you modified the
>>>     output of pdftotext to fit the XML parser, it would be
>>>     helpful for some kind people to develop a patch.
>>>
>>>     Regards,
>>>     mpsuzuki
>>>
>>>
>>>     On 11/14/2013 10:04 PM, Paweł Leń wrote:
>>>
>>>         Hello,
>>>
>>>         I have error when running:
>>>         pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'
>>>
>>>         The output xml have <title> tag on the begining of document
>>> (meta section), error appears when title contains "&" character. Title
>>> field has no CDATA and it is not quoted so it causes error in my xmllib
>>> parser. Can I (or You :) ) fix it somehow?
>>>
>>>         Beast regards
>>>
>>>         *--
>>>         *
>>>
>>>         *Paweł Leń*
>>>
>>>
>>>
>>>         _________________________________________________
>>>         poppler mailing list
>>>         poppler at lists.freedesktop.org <mailto:poppler at lists.
>>> freedesktop.org>
>>>         http://lists.freedesktop.org/__mailman/listinfo/poppler <
>>> http://lists.freedesktop.org/mailman/listinfo/poppler>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131129/260c5d8f/attachment-0001.html>


More information about the poppler mailing list