[poppler] Multicolumn select

Baz brian.ewins at gmail.com
Mon Nov 23 01:37:42 PST 2009


2009/11/18 Albert Astals Cid <aacid at kde.org>:
> A Dilluns, 16 de novembre de 2009, Baz va escriure:
>> I've checked now... yes pdftotext with no flags will hit the new
>> reading order code.
>
> And that is good or bad? :D
>

It turns out, good.

These are the results  of comparing the sizes of diffs to acrobat
output  for poppler before and after the patch. The diff is just done
on word order, to try to pick up paragraphs that have been misplaced.
The filenames refer to the bugzillas where I found these: freedesktop,
gnome, ubuntu launchpad, and kde.

(status, filename, unpatched, patched, difference)
SAME fdo-18531-1.pdf 1215 1218 0%
SAME gno-333967-1.pdf 971 971 0%
PASS gno-360722-1.pdf 553 431 22%
PASS gno-481825-1.pdf 2413 1582 34%
PASS gno-494078-1.pdf 7494 5462 27%
PASS gno-500352-1.pdf 11904 11204 5%
FAIL gno-588476-1.pdf 1192 1277 -7%
FAIL hig-2.0.pdf 3908 5057 -29%
SAME kde-184399-1.pdf 159 159 0%
SAME ubu-181737-1.pdf 18709 18724 0%
FAIL ubu-251412-1.pdf 528 551 -4%
PASS ubu-33288-2.pdf 2535 154 93%
SAME ubu-346403-1.pdf 437 439 0%
PASS ubu-367770-1.pdf 2955 2408 18%

The 3 failures were largely due to numbered footnotes or tables; the
body text was fine. So mostly the patched version is an improvement to
reading order detection. If I can get the bullet points and numbers to
be part of the correct block, those failures would go away.

The test script, in case you want to try this on your corpus; I was
running this in a directory of pdfs with a subdirectory 'acrobat' for
my ground truth. I ignored non-ascii characters because the acrobat
output was in win-1252.

PDF=$1
TXT=${PDF%%.pdf}.txt
cp acrobat/$TXT first
pdftotext $PDF second
~/poppler/utils/pdftotext $PDF third
perl -i.bak -ne 'for(/([A-Za-z0-9.,;:]+)/g){print "$_\n";}' first second third
DIFF12=$(diff -udwb first second | wc -l)
DIFF13=$(diff -udwb first third | wc -l)
DIFF23=$(diff -udwb second third | wc -l)
DIFF=$(expr \( 100 \* \( $DIFF12 - $DIFF13 \) \) / $DIFF12 )
STATUS=SAME
if [ $DIFF -gt 1 ]
then
    STATUS=PASS
fi
if [ $DIFF -lt -1 ]
then
    STATUS=FAIL
fi

echo $STATUS $PDF $DIFF12 $DIFF13 ${DIFF}%


More information about the poppler mailing list