<div>Hi,</div><div> </div><div>After analyzing the problem further, I realized that a "fontconverter" script needs to be run after pdftohtml conversion on the extracted fonts, then most of the documents get converted just fine.</div> <div> </div><div>However, with one particular document that is still giving me problems, I see the following problem on running the font converter script:</div><div> </div><div>*********************</div><div>Converting font ABCDEE+Calibri,Bold-68-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br> offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri,BoldItalic-50-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br> offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri,Italic-28-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br> offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri-5-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br> offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri-52-0... <br> singleFont.py: WARN: No Unicode Mappings found for ABCDEE+Calibri-52-0; I pray it's a CID font<br>Lookup 'kern' Horizontal Kerning lookup 1 has an<br>offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br> Not all applications support extension lookups.</div><div>*********************</div><div> </div><div>I think this is causing the problem. Would anyone know how to go about handling this?</div><div> </div><div>Regards</div> <div>Parul<br><br> </div><div class="gmail_quote">2012/4/5 suzuki toshiya <span dir="ltr"><<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a>></span><br> <blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">I'm not hacking pdftohtml, but I'm interested in whether there is<br> any font related issue. If you can, please make your account to<br> <a href="http://bugs.freedesktop.org" target="_blank">bugs.freedesktop.org</a> bugzilla and file new bug, and upload some<br> sample PDF to there.<br> <br> Regards,<br> mpsuzuki<br> <br> Ihar `Philips` Filipau wrote:<br> > Hi Parul!<br> ><br> > I'm hacking on pdftohtml for other reasons, and can take a look at<br> > your problem - if you can supply me with the PDF. If it is not very<br> > large (say less than 2MB) just send it to me per e-mail.<br> <div>><br> >> Could anyone have an idea why this is happening.<br> ><br> </div>> PDF is effectively graphic format - text internally isn't always<br> > represented as it appears on the screen. It takes some effort and<br> > heuristics to reconstruct words and text from the information inside<br> > the PDF.<br> ><br> > You can also try to use the 'pdftotext' or even 'pdftotext -raw' - it<br> > has more heuristics and generally works better for such corruptions<br> > (at cost of ignoring the formatting).<br> ><br> > wbr.<br> <div><div>><br> > On 4/5/12, Parul Srivastava <<a href="mailto:parul009@gmail.com" target="_blank">parul009@gmail.com</a>> wrote:<br> >> Hi,<br> >><br> >> I am using poppler's pdftohtml converter version 0.17.2 to convert some pdf<br> >> docs to Html format. I realize this is an older version and the current<br> >> stable version is 0.18.4. However, due to certain reasons we are sticking<br> >> to this version.<br> >><br> >> The problem is that when it converts the pdf document to Html form, some<br> >> characters in the Html document are missing. The thing is that when I open<br> >> the Html document in the text editor, no text is missing but while<br> >> displaying, it displays "ration" as "raon". On adding the "letter-spacing<br> >> attribute to this, it appears absolutely correctly as "ration".<br> >><br> >> There is another problem. In some sub-headings, I see a huge negative value<br> >> for letter-spacing which causes the string to be reversed in the Html<br> >> display. For instance, "Co" appears as "oC".<br> >><br> >> Could anyone have an idea why this is happening.<br> >><br> >> p9<br> >><br> ><br> ><br> <br> </div></div></blockquote></div><br>