[poppler] text extraction does not work

Fri Jan 28 00:56:46 PST 2011

A Dijous, 27 de gener de 2011, Axel Strübing va escriure:
> Dear popplers
> 
> I have a file, where pdftotext is unable to extract meaningful text from.
> As far as I can see the reason is as follows:
> 
> The used fonts are CIDType0 fonts with nonstandard CMaps and an explicit
> ToUnicode entries.
> 
> In GfxCIDFont::getNextChar bytes from the content stream are converted to a
> CID by means of
> *code = (CharCode)(cid = cMap->getCID(s, len, &n));
> and afterwards unconditionally used for unicode mapping in
> *uLen = ctu->mapToUnicode(cid, u);
> 
> The adobe specs suggests a ToUnicode entry should be used with highest
> priority for text extraction. Therefore I tentatively concluded that the
> charcodes from the page content stream should be used here and wrote a
> little patch.
> 
> The mapping table (ctu )can originate from:
> - readToUnicodeCMap -> use CharCode from content stream
> - could be generated from getCIDToUnicode -> map from cid to unicode
> - could be generated from getUnicodeToUnicode -> I am not sure about and
> found no example
> 
> So I introduced a flag in GfxFont.(h,cc) and use it in
> GfxCIDFont::getNextChar.
> 
> The patched poppler extracts the problematic pdf fine. Testing text
> extraction with on a  small collection of files yields identical results
> except in my problematic case.
> 
> I attached the pdf and the patch and ask if someone more knowledgeable than
> me could check for possible implications.
> 
> Feel free to change anything. Any thoughts are welcome.

I'll run this to all the pdf i have lying around and see if there is any 
regression or not. Thanks for the patch.

Albert

> 
> regards
> Axel