[poppler] PdftoHtml - Overlapping Characters in Html Due to Missing/Incorrect "Letter-Spacing" attribute

Ihar `Philips` Filipau thephilips at gmail.com
Thu Apr 5 06:08:44 PDT 2012


Hi Parul!

I'm hacking on pdftohtml for other reasons, and can take a look at
your problem - if you can supply me with the PDF. If it is not very
large (say less than 2MB) just send it to me per e-mail.

> Could anyone have an idea why this is happening.

PDF is effectively graphic format - text internally isn't always
represented as it appears on the screen. It takes some effort and
heuristics to reconstruct words and text from the information inside
the PDF.

You can also try to use the 'pdftotext' or even 'pdftotext -raw' - it
has more heuristics and generally works better for such corruptions
(at cost of ignoring the formatting).

wbr.

On 4/5/12, Parul Srivastava <parul009 at gmail.com> wrote:
> Hi,
>
> I am using poppler's pdftohtml converter version 0.17.2 to convert some pdf
> docs to Html format. I realize this is an older version and the current
> stable version is 0.18.4. However, due to certain reasons we are sticking
> to this version.
>
> The problem is that when it converts the pdf document to Html form, some
> characters in the Html document are missing. The thing is that when I open
> the Html document in the text editor, no text is missing but while
> displaying, it displays "ration" as "raon". On adding the "letter-spacing
> attribute to this, it appears absolutely correctly as "ration".
>
> There is another problem. In some sub-headings, I see a huge negative value
> for letter-spacing which causes the string to be reversed in the Html
> display. For instance, "Co" appears as "oC".
>
> Could anyone have an idea why this is happening.
>
> p9
>


-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)


More information about the poppler mailing list