<div>Hi,</div><div> </div><div>After analyzing the problem further, I realized that a "fontconverter" script needs to be run after pdftohtml conversion on the extracted fonts, then most of the documents get converted just fine.</div>
<div> </div><div>However, with one particular document that is still giving me problems, I see the following problem on running the font converter script:</div><div> </div><div>*********************</div><div>Converting font ABCDEE+Calibri,Bold-68-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br>
offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri,BoldItalic-50-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br>
offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri,Italic-28-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br>
offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri-5-0...Lookup 'kern' Horizontal Kerning lookup 1 has an<br>
offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>Not all applications support extension lookups.<br>Converting font ABCDEE+Calibri-52-0... <br>
singleFont.py: WARN: No Unicode Mappings found for ABCDEE+Calibri-52-0; I pray it's a CID font<br>Lookup 'kern' Horizontal Kerning lookup 1 has an<br>offset bigger than 65535 bytes. This means<br>FontForge must use an extension lookup to output it.<br>
Not all applications support extension lookups.</div><div>*********************</div><div> </div><div>I think this is causing the problem. Would anyone know how to go about handling this?</div><div> </div><div>Regards</div>
<div>Parul<br><br> </div><div class="gmail_quote">2012/4/5 suzuki toshiya <span dir="ltr"><<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a>></span><br>
<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">I'm not hacking pdftohtml, but I'm interested in whether there is<br>
any font related issue. If you can, please make your account to<br>
<a href="http://bugs.freedesktop.org" target="_blank">bugs.freedesktop.org</a> bugzilla and file new bug, and upload some<br>
sample PDF to there.<br>
<br>
Regards,<br>
mpsuzuki<br>
<br>
Ihar `Philips` Filipau wrote:<br>
> Hi Parul!<br>
><br>
> I'm hacking on pdftohtml for other reasons, and can take a look at<br>
> your problem - if you can supply me with the PDF. If it is not very<br>
> large (say less than 2MB) just send it to me per e-mail.<br>
<div>><br>
>> Could anyone have an idea why this is happening.<br>
><br>
</div>> PDF is effectively graphic format - text internally isn't always<br>
> represented as it appears on the screen. It takes some effort and<br>
> heuristics to reconstruct words and text from the information inside<br>
> the PDF.<br>
><br>
> You can also try to use the 'pdftotext' or even 'pdftotext -raw' - it<br>
> has more heuristics and generally works better for such corruptions<br>
> (at cost of ignoring the formatting).<br>
><br>
> wbr.<br>
<div><div>><br>
> On 4/5/12, Parul Srivastava <<a href="mailto:parul009@gmail.com" target="_blank">parul009@gmail.com</a>> wrote:<br>
>> Hi,<br>
>><br>
>> I am using poppler's pdftohtml converter version 0.17.2 to convert some pdf<br>
>> docs to Html format. I realize this is an older version and the current<br>
>> stable version is 0.18.4. However, due to certain reasons we are sticking<br>
>> to this version.<br>
>><br>
>> The problem is that when it converts the pdf document to Html form, some<br>
>> characters in the Html document are missing. The thing is that when I open<br>
>> the Html document in the text editor, no text is missing but while<br>
>> displaying, it displays "ration" as "raon". On adding the "letter-spacing<br>
>> attribute to this, it appears absolutely correctly as "ration".<br>
>><br>
>> There is another problem. In some sub-headings, I see a huge negative value<br>
>> for letter-spacing which causes the string to be reversed in the Html<br>
>> display. For instance, "Co" appears as "oC".<br>
>><br>
>> Could anyone have an idea why this is happening.<br>
>><br>
>> p9<br>
>><br>
><br>
><br>
<br>
</div></div></blockquote></div><br>