[poppler] Regression in text extraction

Albert Astals Cid aacid at kde.org
Sun Jun 29 06:49:32 PDT 2008


A Diumenge 29 Juny 2008, Adrian Johnson va escriure:
> The following commit introduced a regression in text extraction from PDF
> files that use ActualText:
>
>     commit 2da15db4751d3cb93d40b48e348dbc51f6e7a29f
>     Author: Carlos Garcia Campos <carlosgc at gnome.org>
>     Date:   Fri Jun 20 11:39:08 2008 +0200
>
>         Do not create an OCGs object if there isn't an OCProperties
>         dictionary in the Catalog
>
> The problem is the code added to Gfx::opBeginMarkedContent() that exits
> the function before beginMarkedContent() in the TextOuputDev is called.
> Gfx::opEndMarkedContent() also has the same problem.

Right, the attached patch should fix the problem, can you test?

Also can you please send an url to a pdf where ActualText gives a different 
output than "classical" text extraction?

Albert

>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler


-------------- next part --------------
A non-text attachment was scrubbed...
Name: markedContent.patch
Type: text/x-diff
Size: 1230 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080629/5a319b02/attachment.patch 


More information about the poppler mailing list