[poppler] Extracting Soft Hyphens from PDF
Mike Tonks
fluffymike at googlemail.com
Tue Jan 19 02:03:58 PST 2010
Hi,
Does poppler support extraction / removal of soft hyphens (unicode
173) from PDF documents?
I am working on converting PDF documents to Ebook formats, and we need
to extract the text and formatting information to try to reflow the
document and create basic layout.
I find that pdftohtml for example inserts normal hyphens into the text
where the soft hyphen merely indicates the word was broken at a
suitable place, but should not appear in the text / html version of
the document.
Currently the only program I can find that extracts the text correctly
without hyphens is Adobe Acrobat Pro.
Thanks for any assistance,
Mike Tonks
More information about the poppler
mailing list