[poppler] PdftoHtml - Overlapping Characters in Html Due to Missing/Incorrect "Letter-Spacing" attribute

Parul Srivastava parul009 at gmail.com
Thu Apr 12 21:40:54 PDT 2012


Hi,

After analyzing the problem further, I realized that a "fontconverter"
script needs to be run after pdftohtml conversion on the extracted fonts,
then most of the documents get converted just fine.

However, with one particular document that is still giving me problems, I
see the following problem on running the font converter script:

*********************
Converting font ABCDEE+Calibri,Bold-68-0...Lookup 'kern' Horizontal Kerning
lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Converting font ABCDEE+Calibri,BoldItalic-50-0...Lookup 'kern' Horizontal
Kerning lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Converting font ABCDEE+Calibri,Italic-28-0...Lookup 'kern' Horizontal
Kerning lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Converting font ABCDEE+Calibri-5-0...Lookup 'kern' Horizontal Kerning
lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Converting font ABCDEE+Calibri-52-0...
singleFont.py: WARN: No Unicode Mappings found for ABCDEE+Calibri-52-0; I
pray it's a CID font
Lookup 'kern' Horizontal Kerning lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
*********************

I think this is causing the problem. Would anyone know how to go about
handling this?

Regards
Parul


2012/4/5 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>

> I'm not hacking pdftohtml, but I'm interested in whether there is
> any font related issue. If you can, please make your account to
> bugs.freedesktop.org bugzilla and file new bug, and upload some
> sample PDF to there.
>
> Regards,
> mpsuzuki
>
> Ihar `Philips` Filipau wrote:
> > Hi Parul!
> >
> > I'm hacking on pdftohtml for other reasons, and can take a look at
> > your problem - if you can supply me with the PDF. If it is not very
> > large (say less than 2MB) just send it to me per e-mail.
> >
> >> Could anyone have an idea why this is happening.
> >
> > PDF is effectively graphic format - text internally isn't always
> > represented as it appears on the screen. It takes some effort and
> > heuristics to reconstruct words and text from the information inside
> > the PDF.
> >
> > You can also try to use the 'pdftotext' or even 'pdftotext -raw' - it
> > has more heuristics and generally works better for such corruptions
> > (at cost of ignoring the formatting).
> >
> > wbr.
> >
> > On 4/5/12, Parul Srivastava <parul009 at gmail.com> wrote:
> >> Hi,
> >>
> >> I am using poppler's pdftohtml converter version 0.17.2 to convert some
> pdf
> >> docs to Html format. I realize this is an older version and the current
> >> stable version is 0.18.4. However, due to certain reasons we are
> sticking
> >> to this version.
> >>
> >> The problem is that when it converts the pdf document to Html form, some
> >> characters in the Html document are missing. The thing is that when I
> open
> >> the Html document in the text editor, no text is missing but while
> >> displaying, it displays "ration" as "raon". On adding the
> "letter-spacing
> >> attribute to this, it appears absolutely correctly as "ration".
> >>
> >> There is another problem. In some sub-headings, I see a huge negative
> value
> >> for letter-spacing which causes the string to be reversed in the Html
> >> display. For instance, "Co" appears as "oC".
> >>
> >> Could anyone have an idea why this is happening.
> >>
> >> p9
> >>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20120413/6d10f71f/attachment.html>


More information about the poppler mailing list