[poppler] text extraction does not work

Axel Strübing axel.struebing at freenet.de
Thu Jan 27 00:13:47 PST 2011


Dear popplers

I have a file, where pdftotext is unable to extract meaningful text from. As 
far as I can see the reason is as follows:

The used fonts are CIDType0 fonts with nonstandard CMaps and an explicit 
ToUnicode entries.

In GfxCIDFont::getNextChar bytes from the content stream are converted to a 
CID by means of 
*code = (CharCode)(cid = cMap->getCID(s, len, &n));
and afterwards unconditionally used for unicode mapping in
*uLen = ctu->mapToUnicode(cid, u);

The adobe specs suggests a ToUnicode entry should be used with highest 
priority for text extraction. Therefore I tentatively concluded that the 
charcodes from the page content stream should be used here and wrote a little 
patch.

The mapping table (ctu )can originate from:
- readToUnicodeCMap -> use CharCode from content stream
- could be generated from getCIDToUnicode -> map from cid to unicode
- could be generated from getUnicodeToUnicode -> I am not sure about and found 
no example

So I introduced a flag in GfxFont.(h,cc) and use it in 
GfxCIDFont::getNextChar.

The patched poppler extracts the problematic pdf fine. Testing text extraction 
with on a  small collection of files yields identical results except in my 
problematic case.

I attached the pdf and the patch and ask if someone more knowledgeable than me 
could check for possible implications.

Feel free to change anything. Any thoughts are welcome.

regards
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report.pdf
Type: application/pdf
Size: 173409 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110127/6db38278/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: textextraction.patch
Type: text/x-patch
Size: 1900 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110127/6db38278/attachment-0001.bin>


More information about the poppler mailing list