[poppler] PdftoHtml - Overlapping Characters in Html Due to Missing/Incorrect "Letter-Spacing" attribute

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu Apr 5 06:30:06 PDT 2012


I'm not hacking pdftohtml, but I'm interested in whether there is
any font related issue. If you can, please make your account to
bugs.freedesktop.org bugzilla and file new bug, and upload some
sample PDF to there.

Regards,
mpsuzuki

Ihar `Philips` Filipau wrote:
> Hi Parul!
> 
> I'm hacking on pdftohtml for other reasons, and can take a look at
> your problem - if you can supply me with the PDF. If it is not very
> large (say less than 2MB) just send it to me per e-mail.
> 
>> Could anyone have an idea why this is happening.
> 
> PDF is effectively graphic format - text internally isn't always
> represented as it appears on the screen. It takes some effort and
> heuristics to reconstruct words and text from the information inside
> the PDF.
> 
> You can also try to use the 'pdftotext' or even 'pdftotext -raw' - it
> has more heuristics and generally works better for such corruptions
> (at cost of ignoring the formatting).
> 
> wbr.
> 
> On 4/5/12, Parul Srivastava <parul009 at gmail.com> wrote:
>> Hi,
>>
>> I am using poppler's pdftohtml converter version 0.17.2 to convert some pdf
>> docs to Html format. I realize this is an older version and the current
>> stable version is 0.18.4. However, due to certain reasons we are sticking
>> to this version.
>>
>> The problem is that when it converts the pdf document to Html form, some
>> characters in the Html document are missing. The thing is that when I open
>> the Html document in the text editor, no text is missing but while
>> displaying, it displays "ration" as "raon". On adding the "letter-spacing
>> attribute to this, it appears absolutely correctly as "ration".
>>
>> There is another problem. In some sub-headings, I see a huge negative value
>> for letter-spacing which causes the string to be reversed in the Html
>> display. For instance, "Co" appears as "oC".
>>
>> Could anyone have an idea why this is happening.
>>
>> p9
>>
> 
> 



More information about the poppler mailing list