[poppler] Extracting Soft Hyphens from PDF

Tue Jan 19 14:54:50 PST 2010

A Dimarts, 19 de gener de 2010, Mike Tonks va escriure:
> Hi,
> 
> Does poppler support extraction / removal of soft hyphens (unicode
> 173) from PDF documents?
> 
> I am working on converting PDF documents to Ebook formats, and we need
> to extract the text and formatting information to try to reflow the
> document and create basic layout.
> 
> I find that pdftohtml for example inserts normal hyphens into the text
> where the soft hyphen merely indicates the word was broken at a
> suitable place, but should not appear in the text / html version of
> the document.
> 
> Currently the only program I can find that extracts the text correctly
> without hyphens is Adobe Acrobat Pro.

I'm not sure any of our tools can do that at the moment but probably it would 
not be too difficult to achieve. On the other hand we are more than 
understaffed so if you think it is important for you we will be very happy to 
review any patch you might have to add that feature to poppler tools.

Albert

> 
> 
> Thanks for any assistance,
> 
> Mike Tonks
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>