[poppler] text extraction does not work
Axel Strübing
axel.struebing at freenet.de
Thu Jan 27 00:13:47 PST 2011
Dear popplers
I have a file, where pdftotext is unable to extract meaningful text from. As
far as I can see the reason is as follows:
The used fonts are CIDType0 fonts with nonstandard CMaps and an explicit
ToUnicode entries.
In GfxCIDFont::getNextChar bytes from the content stream are converted to a
CID by means of
*code = (CharCode)(cid = cMap->getCID(s, len, &n));
and afterwards unconditionally used for unicode mapping in
*uLen = ctu->mapToUnicode(cid, u);
The adobe specs suggests a ToUnicode entry should be used with highest
priority for text extraction. Therefore I tentatively concluded that the
charcodes from the page content stream should be used here and wrote a little
patch.
The mapping table (ctu )can originate from:
- readToUnicodeCMap -> use CharCode from content stream
- could be generated from getCIDToUnicode -> map from cid to unicode
- could be generated from getUnicodeToUnicode -> I am not sure about and found
no example
So I introduced a flag in GfxFont.(h,cc) and use it in
GfxCIDFont::getNextChar.
The patched poppler extracts the problematic pdf fine. Testing text extraction
with on a small collection of files yields identical results except in my
problematic case.
I attached the pdf and the patch and ask if someone more knowledgeable than me
could check for possible implications.
Feel free to change anything. Any thoughts are welcome.
regards
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report.pdf
Type: application/pdf
Size: 173409 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110127/6db38278/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: textextraction.patch
Type: text/x-patch
Size: 1900 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110127/6db38278/attachment-0001.bin>
More information about the poppler
mailing list