[poppler] pdftohtml does not preserve fonts

Wed Oct 26 10:00:42 PDT 2011

Many thanks Josh,  

Very interesting to hear you're working on this. Indeed, I tested quite a few things and there seems to be few to do with WebKit letter-spacing (only has effect with huge difference). The bug has been known for years, but strangely enough nothing has been done : https://bugs.webkit.org/show_bug.cgi?id=20606

No issue then with font extraction, just wondered if it was normal not to have the otf/woff/eot — and so is it. Would love those scripts. Since I'm working on OS X, I use FontXChange which works fine, but is not the good solution to automatize this.  

I'll have a look at the other points then.   

--  
Clément Wehrung
06 88 10 65 91

Le mercredi 26 octobre 2011 à 18:55, Josh Richardson a écrit :

> Yes, I'm aware of the Gecko vs. Webkit issue.  I have a colleague checking with the Webkit developers — apparently a fix is underway for the decimal issues, but we're unsure when it will be ready.  In the mean time, I tried using text-align-last, but Webkit doesn't seem to honor that.  I tried text-align-justify, but Webkit seems to never reduce spacing in order to justify, so it breaks different than the original document.
>  
> Currently I'm working on a new option for pdftohtml which will place each word in its own span.  While being heavy, this should overcome some of Webkit's current limitations, and make these pages more usable on Safari/Chrome, etc., although the character-spacing limitation will mean that all the justification will happen between words — less ideal than how it will work on FireFox.
>  
> I'm not sure exactly your issue with font extraction.  Font extraction is relatively simple code with no external dependency, so that should be working.  I have not built into pdftohtml to do font ^conversion^ into web-enabled formats (WOFF/TTF), because I think FontForge, etc. is more suitable for that particular task.  I have a couple Python scripts to do it, which if it's acceptable to the Poppler maintainers, I'd be happy to check into the repository.
>  
> Best, --josh
>  
> From:  Clément Wehrung <cwehrung at gmail.com (mailto:cwehrung at gmail.com)>
> Date:  Wed, 26 Oct 2011 08:14:09 -0700
> To:  Josh Richardson <jric at chegg.com (mailto:jric at chegg.com)>
> Cc:  Clément Wehrung <cwehrung at nurves.com (mailto:cwehrung at nurves.com)>, "poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)" <poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)>, Alec Taylor <alec.taylor6 at gmail.com (mailto:alec.taylor6 at gmail.com)>
> Subject:  Re: [poppler] pdftohtml does not preserve fonts
>  
> Sure, but I reproduce there are (I believe) two issues here :  
> 1) justification is more complicated with webkit due to not (really) working optimizeLegibility in WebKit and the fact that WebKit handles poorly decimal in word-spacing and not at all in letter-spacing
> 2) due to kerning (I can send you a screenshot comparing in Photoshop two texts one over the other) / letter-spacing / word-spacing (?), lines are much longer in WebKit => hence, if you have for example "footnotes" as in this PDF, they don't get at the right place in the text (all the more so as if you have a PDF from an InDesign export, there may be "metrics" which cause some text to go over another — yet, you can always remove all metrics before exporting in PDF…it avoids part of the issue)  
>  
> NB : I don't manage to get the fonts extracted to work, but I can send those to you in otf if you want (don't know if extraction is not working due to my installation ?)  
>  
> PDF file : BugWebkit.pdf (http://cl.ly/0L3g2I1r3G2a0T0o3622)  
>  
> --  
> Clément Wehrung
> 06 88 10 65 91
>  
> Le mercredi 26 octobre 2011 à 14:35, Clément Wehrung a écrit :
>  
> > You can understand better the issue here (Firefox vs Safari on Mac/iOS)
> >  
> > http://dev.nurves.com/pdf2html/-6.html
> >  
> > Cf. footnotes
> >  
> > WebKit.png (http://cl.ly/3c1B2V1X2u2C2f0M2L0L)
> > Firefox.png (http://cl.ly/0Q111C3u2g3T2U1D3U2u)
> > --  
> > Clément Wehrung
> > 06 88 10 65 91
> >  
> >  
> >  
> > Le mercredi 26 octobre 2011 à 14:26, Clément Wehrung a écrit :
> >  
> > > Hi Josh,
> > >  
> > > Thanks for all this. I'm already looking at the code now, but I've run into some issues with webkit rendering compared to Firefox (where it looks really amazing !). I know webkit has a bug with letter-spacing (does not take decimal into account) but there's more to it since text-rendering:optimizeLegibility; only partly works. I try to see how we could get text boxes not to end up one over the other. I can show you some screenshots if you want.  
> > >  
> > > btw, when have you chosen not to use only the background image for all graphics ? is it in order to achieve some image over text ?
> > >  
> > > Thanks,  
> > >  
> > > Clement  
> > >  
> > > --  
> > > Clément Wehrung
> > > 06 88 10 65 91
> > >  
> > > Le mardi 25 octobre 2011 à 00:41, Josh Richardson a écrit :
> > >  
> > > > Ok, sent you a read-only access invitation for now.  Thanks for your offer to help.  Here is my bigger issues list to get a flavor – a lot of fun things to do.  Let me know what you want to do with pdftohtml!
> > > >  
> > > > Translate drawing operations into canvas with SVG
> > > > Find better way to calculate vertical positioning, by looking at browser source code
> > > > z-index handling -- currently text is never masked by graphics
> > > > Algorithmic extraction of TOC
> > > > Algorithmic extraction of page numbering (Alec may be working on this)
> > > > Algorithmic identification of chapters
> > > > Right-to-left text, proper display (e.g. Arabic, Hebrew)
> > > > Algorithmic detection of text flow (Stephen may be working on this)
> > > > Detection / removal of duplicate images
> > > > Jpg vs. png selection; automatically choose the best format for each image
> > > >  
> > > >  
> > > > --josh
> > > >  
> > > > From:  Clément Wehrung <cwehrung at nurves.com (mailto:cwehrung at nurves.com)>
> > > > Date:  Mon, 24 Oct 2011 15:27:23 -0700
> > > > To:  Josh Richardson <jric at chegg.com (mailto:jric at chegg.com)>
> > > > Cc:  "poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)" <poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)>, Alec Taylor <alec.taylor6 at gmail.com (mailto:alec.taylor6 at gmail.com)>
> > > > Subject:  Re: [poppler] pdftohtml does not preserve fonts
> > > >  
> > > > Sure ! Do you have a link for the repo so that I can already have a look (I didn't figure out which one it is right now) ? I'm really interested in helping you, if you need something on any specific topic don't hesitate. Many thanks again,
> > > >  
> > > > Clément
> > > >  
> > > >  
> > > > On Mon, Oct 24, 2011 at 8:01 PM, Josh Richardson <jric at chegg.com (mailto:jric at chegg.com)> wrote:
> > > > > Can you give me a couple of days?  I want to try to get a repo hosted on,
> > > > >  e.g. bitbucket, which is connected to my repo, so that it's easier to keep
> > > > >  everything in synch.  Alec Taylor set up a repo there already, which you
> > > > >  can use to get an immediate snapshot if needed.
> > > > >  
> > > > >  Best, --josh
> > > > >  
> > > > >  On 10/24/11 10:45 AM, "iclems" <cwehrung at nurves.Com (mailto:cwehrung at nurves.Com)> wrote:
> > > > >  
> > > > > >
> > > > > >Dear Josh,
> > > > > >
> > > > > >Being working on a pdftohtml project which requires font preservation, I'd
> > > > > >be really interested in getting this too. Do you think it's possible ?
> > > > > >
> > > > > >Thanks,
> > > > > >
> > > > > >Clement
> > > > > >cwehrung at gmail.com (mailto:cwehrung at gmail.com)
> > > > > >
> > > > > >
> > > > > >Josh Richardson wrote:
> > > > > >>
> > > > > >> Preserving fonts is not integrated into the master repository yet.  If
> > > > > >>you
> > > > > >> like, I can send you a patched version of Poppler which will do it.
> > > > > >> You'll still have to run your own process (like Fontforge) to convert
> > > > > >>the
> > > > > >> fonts into a web-usable format, but it's straightforward as long as the
> > > > > >> fonts have mapping to unicode, and doable even without.
> > > > > >>
> > > > > >> --josh
> > > > > >>
> > > > > >> From: M Naveed Akram <cmnajs at gmail.com (mailto:cmnajs at gmail.com)<mailto:cmnajs at gmail.com>>
> > > > > >> Date: Fri, 30 Sep 2011 06:52:14 -0700
> > > > > >> To:
> > > > > >>"poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)<mailto:poppler at lists.freedesktop.org>"
> > > > > >> <poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)<mailto:poppler at lists.freedesktop.org>>
> > > > > >> Subject: [poppler] pdftohtml does not preserve fonts
> > > > > >>
> > > > > >> Hi,
> > > > > >>
> > > > > >> I have been using 0.16 release of poppler-utils, but I am facing a
> > > > > >> problem. When converting pdf to html using pdftohtml it does not
> > > > > >>preserve
> > > > > >> fonts in the output html. How can I solve this issue. Please help
> > > > > >>
> > > > > >>
> > > > > >> _______________________________________________
> > > > > >> poppler mailing list
> > > > > >> poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)
> > > > > >> http://lists.freedesktop.org/mailman/listinfo/poppler
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >--
> > > > > >View this message in context:
> > > > > >http://old.nabble.com/pdftohtml-does-not-preserve-fonts-tp32569116p3271208
> > > > > >4.html
> > > > > >Sent from the Free Desktop - poppler mailing list archive at Nabble.com (http://Nabble.com).
> > > > > >
> > > > > >_______________________________________________
> > > > > >poppler mailing list
> > > > > >poppler at lists.freedesktop.org (mailto:poppler at lists.freedesktop.org)
> > > > > >http://lists.freedesktop.org/mailman/listinfo/poppler
> > > > > >
> > > > >  
> > > >  
> > >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111026/0f331706/attachment.htm>