[poppler] Extracting Soft Hyphens from PDF

Mike Tonks fluffymike at googlemail.com
Tue Jan 19 02:03:58 PST 2010


Hi,

Does poppler support extraction / removal of soft hyphens (unicode
173) from PDF documents?

I am working on converting PDF documents to Ebook formats, and we need
to extract the text and formatting information to try to reflow the
document and create basic layout.

I find that pdftohtml for example inserts normal hyphens into the text
where the soft hyphen merely indicates the word was broken at a
suitable place, but should not appear in the text / html version of
the document.

Currently the only program I can find that extracts the text correctly
without hyphens is Adobe Acrobat Pro.


Thanks for any assistance,

Mike Tonks


More information about the poppler mailing list