<div dir="ltr">Hello :)<br><br>Everything works fine, thank You very much!<br><br>Best Regards</div><div class="gmail_extra"><br clear="all"><div><div style="font-family:Arial,Tahoma,Verdana,sans-serif;font-size:14px;color:#111111">


        <p style="font-size:18px;margin:0pt"><b>--<br></b></p><p style="font-size:18px;margin:0pt"><b>Paweł Leń</b></p></div></div>
<br><br><div class="gmail_quote">2013/11/15 suzuki toshiya <span dir="ltr"><<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

How about this?<br>
<br>
Regards,<br>
mpsuzuki<div><div class="h5"><br>
<br>
On 11/15/2013 04:26 PM, suzuki toshiya wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
I'm trying to fix this issue by an insertion of myXmlTokenReplace()<br>
into printInfoString().<br>
<br>
Regards,<br>
mpsuzuki<br>
<br>
On 11/14/2013 10:42 PM, Paweł Leń wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This is the contents of file output.xml generated by command pdftotext -bbox -htmlmeta 'myfile.pdf' 'output.xml' :<br>
<br>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "<a href="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" target="_blank">http://www.w3.org/TR/xhtml1/<u></u>DTD/xhtml1-transitional.dtd</a>"><<u></u>html xmlns="<a href="http://www.w3.org/1999/xhtml" target="_blank">http://www.w3.org/1999/<u></u>xhtml</a>"><br>


<head><br>
<title>Microsoft Word - Preface&Contents_Advances_in_<u></u>Lasers_and_Electro_Optics.doc<<u></u>/title><br>
<meta name="Author" content="Teodora"/><br>
<meta name="Creator" content="PScript5.dll Version 5.2.2"/><br>
<meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/><br>
<meta name="CreationDate" content=""/><br>
</head><br>
<body><br>
<doc><br>
   <page width="<a href="tel:482.000000" value="+48482000000" target="_blank">482.000000</a>" height="<a href="tel:680.000000" value="+48680000000" target="_blank">680.000000</a>"><br>
     <word xMin="<a href="tel:255.120000" value="+48255120000" target="_blank">255.120000</a>" yMin="190.576860" xMax="<a href="tel:338.055540" value="+48338055540" target="_blank">338.055540</a>" yMax="207.269700">Advances</<u></u>word><br>


     <word xMin="<a href="tel:344.000562" value="+48344000562" target="_blank">344.000562</a>" yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word><br>
     <word xMin="365.276724" yMin="190.576860" xMax="<a href="tel:425.239584" value="+48425239584" target="_blank">425.239584</a>" yMax="207.269700">Lasers</<u></u>word><br>


     <word xMin="<a href="tel:256.260624" value="+48256260624" target="_blank">256.260624</a>" yMin="207.256884" xMax="288.954240" yMax="<a href="tel:223.949724" value="+48223949724" target="_blank">223.949724</a>">and</word><br>


     <word xMin="<a href="tel:294.884844" value="+48294884844" target="_blank">294.884844</a>" yMin="207.256884" xMax="363.168492" yMax="<a href="tel:223.949724" value="+48223949724" target="_blank">223.949724</a>">Electro</word><br>


     <word xMin="369.099096" yMin="207.256884" xMax="<a href="tel:425.265216" value="+48425265216" target="_blank">425.265216</a>" yMax="<a href="tel:223.949724" value="+48223949724" target="_blank">223.949724</a>">Optics</word><br>


   </page><br>
</doc><br>
</body><br>
</html><br>
<br>
<br>
As You can see in line 3 tag <title> contains invalid character squence with "&".  The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars is needed.<br>
<br>
<br>
<br>
<br>
*--<br>
*<br>
<br>
*Paweł Leń*<br>
<br>
<br>
<br>
2013/11/14 suzuki toshiya <<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a> <mailto:<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.<u></u>ac.jp</a>>><br>


<br>
    Hi,<br>
<br>
    If you could post a sample XML file that you modified the<br>
    output of pdftotext to fit the XML parser, it would be<br>
    helpful for some kind people to develop a patch.<br>
<br>
    Regards,<br>
    mpsuzuki<br>
<br>
<br>
    On 11/14/2013 10:04 PM, Paweł Leń wrote:<br>
<br>
        Hello,<br>
<br>
        I have error when running:<br>
        pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'<br>
<br>
        The output xml have <title> tag on the begining of document (meta section), error appears when title contains "&" character. Title field has no CDATA and it is not quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?<br>


<br>
        Beast regards<br>
<br>
        *--<br>
        *<br>
<br>
        *Paweł Leń*<br>
<br>
<br>
<br>
        ______________________________<u></u>___________________<br>
        poppler mailing list<br>
        <a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a> <mailto:<a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.<u></u>freedesktop.org</a>><br>


        <a href="http://lists.freedesktop.org/__mailman/listinfo/poppler" target="_blank">http://lists.freedesktop.org/_<u></u>_mailman/listinfo/poppler</a> <<a href="http://lists.freedesktop.org/mailman/listinfo/poppler" target="_blank">http://lists.freedesktop.org/<u></u>mailman/listinfo/poppler</a>><br>


<br>
<br>
<br>
</blockquote>
<br></div></div><div class="im">
______________________________<u></u>_________________<br>
poppler mailing list<br>
<a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a><br>
<a href="http://lists.freedesktop.org/mailman/listinfo/poppler" target="_blank">http://lists.freedesktop.org/<u></u>mailman/listinfo/poppler</a><br>
</div></blockquote>
<br>
</blockquote></div><br></div>