[poppler] text extraction does not work
Albert Astals Cid
aacid at kde.org
Fri Jan 28 11:22:13 PST 2011
A Divendres, 28 de gener de 2011, Albert Astals Cid va escriure:
> A Dijous, 27 de gener de 2011, Axel Strübing va escriure:
> > Dear popplers
> > I have a file, where pdftotext is unable to extract meaningful text from.
> > As far as I can see the reason is as follows:
> > The used fonts are CIDType0 fonts with nonstandard CMaps and an explicit
> > ToUnicode entries.
> > In GfxCIDFont::getNextChar bytes from the content stream are converted to
> > a CID by means of
> > *code = (CharCode)(cid = cMap->getCID(s, len, &n));
> > and afterwards unconditionally used for unicode mapping in
> > *uLen = ctu->mapToUnicode(cid, u);
> > The adobe specs suggests a ToUnicode entry should be used with highest
> > priority for text extraction. Therefore I tentatively concluded that the
> > charcodes from the page content stream should be used here and wrote a
> > little patch.
> > The mapping table (ctu )can originate from:
> > - readToUnicodeCMap -> use CharCode from content stream
> > - could be generated from getCIDToUnicode -> map from cid to unicode
> > - could be generated from getUnicodeToUnicode -> I am not sure about and
> > found no example
> > So I introduced a flag in GfxFont.(h,cc) and use it in
> > GfxCIDFont::getNextChar.
> > The patched poppler extracts the problematic pdf fine. Testing text
> > extraction with on a small collection of files yields identical results
> > except in my problematic case.
> > I attached the pdf and the patch and ask if someone more knowledgeable
> > than me could check for possible implications.
> > Feel free to change anything. Any thoughts are welcome.
> I'll run this to all the pdf i have lying around and see if there is any
> regression or not. Thanks for the patch.
The regression test passed successfully so i've commited your somehow cleanup
patch to git and will be part of poppler 0.16.2.
If you do have any other patch, keep them coming!
> > regards
> > Axel
> poppler mailing list
> poppler at lists.freedesktop.org
More information about the poppler