[poppler] XML syntax error in PdfToText tool

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Mon Dec 2 01:00:57 PST 2013


I'm glad to hear that the situation is improved.
At present, I'm unfamiliar about how other characters,
like non-ASCII Latin characters, are embedded by popular
PDF production workflow and how they should be handled
in poppler. If you have similar trouble in future, please
post to this list!

Regards,
mpsuzuki

On 11/29/2013 11:58 PM, Paweł Leń wrote:
> Hello :)
>
> Everything works fine, thank You very much!
>
> Best Regards
>
> *--
> *
>
> *Paweł Leń*
>
>
>
> 2013/11/15 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>>
>
>     How about this?
>
>     Regards,
>     mpsuzuki
>
>
>     On 11/15/2013 04:26 PM, suzuki toshiya wrote:
>
>         I'm trying to fix this issue by an insertion of myXmlTokenReplace()
>         into printInfoString().
>
>         Regards,
>         mpsuzuki
>
>         On 11/14/2013 10:42 PM, Paweł Leń wrote:
>
>             This is the contents of file output.xml generated by command pdftotext -bbox -htmlmeta 'myfile.pdf' 'output.xml' :
>
>             <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/__DTD/xhtml1-transitional.dtd <http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>"><__html xmlns="http://www.w3.org/1999/__xhtml <http://www.w3.org/1999/xhtml>">
>             <head>
>             <title>Microsoft Word - Preface&Contents_Advances_in___Lasers_and_Electro_Optics.doc<__/title>
>             <meta name="Author" content="Teodora"/>
>             <meta name="Creator" content="PScript5.dll Version 5.2.2"/>
>             <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
>             <meta name="CreationDate" content=""/>
>             </head>
>             <body>
>             <doc>
>                 <page width="482.000000 <tel:482.000000>" height="680.000000 <tel:680.000000>">
>                   <word xMin="255.120000 <tel:255.120000>" yMin="190.576860" xMax="338.055540 <tel:338.055540>" yMax="207.269700">Advances</__word>
>                   <word xMin="344.000562 <tel:344.000562>" yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word>
>                   <word xMin="365.276724" yMin="190.576860" xMax="425.239584 <tel:425.239584>" yMax="207.269700">Lasers</__word>
>                   <word xMin="256.260624 <tel:256.260624>" yMin="207.256884" xMax="288.954240" yMax="223.949724 <tel:223.949724>">and</word>
>                   <word xMin="294.884844 <tel:294.884844>" yMin="207.256884" xMax="363.168492" yMax="223.949724 <tel:223.949724>">Electro</word>
>                   <word xMin="369.099096" yMin="207.256884" xMax="425.265216 <tel:425.265216>" yMax="223.949724 <tel:223.949724>">Optics</word>
>                 </page>
>             </doc>
>             </body>
>             </html>
>
>
>             As You can see in line 3 tag <title> contains invalid character squence with "&".  The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars is needed.
>
>
>
>
>             *--
>             *
>
>             *Paweł Leń*
>
>
>
>             2013/11/14 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp> <mailto:mpsuzuki at hiroshima-u.__ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>>>
>
>                  Hi,
>
>                  If you could post a sample XML file that you modified the
>                  output of pdftotext to fit the XML parser, it would be
>                  helpful for some kind people to develop a patch.
>
>                  Regards,
>                  mpsuzuki
>
>
>                  On 11/14/2013 10:04 PM, Paweł Leń wrote:
>
>                      Hello,
>
>                      I have error when running:
>                      pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'
>
>                      The output xml have <title> tag on the begining of document (meta section), error appears when title contains "&" character. Title field has no CDATA and it is not quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?
>
>                      Beast regards
>
>                      *--
>                      *
>
>                      *Paweł Leń*
>
>
>
>                      ___________________________________________________
>                      poppler mailing list
>             poppler at lists.freedesktop.org <mailto:poppler at lists.freedesktop.org> <mailto:poppler at lists.__freedesktop.org <mailto:poppler at lists.freedesktop.org>>
>             http://lists.freedesktop.org/____mailman/listinfo/poppler <http://lists.freedesktop.org/__mailman/listinfo/poppler> <http://lists.freedesktop.org/__mailman/listinfo/poppler <http://lists.freedesktop.org/mailman/listinfo/poppler>>
>
>
>
>
>         _________________________________________________
>         poppler mailing list
>         poppler at lists.freedesktop.org <mailto:poppler at lists.freedesktop.org>
>         http://lists.freedesktop.org/__mailman/listinfo/poppler <http://lists.freedesktop.org/mailman/listinfo/poppler>
>
>
>



More information about the poppler mailing list