[poppler] Multicolumn select

Mon Dec 7 15:25:23 PST 2009

Sorry for the late reply i've moved job and country meanwhile :D

A Dilluns 23 Novembre 2009 09:37:42, Baz va escriure:
> 2009/11/18 Albert Astals Cid <aacid at kde.org>:
> > A Dilluns, 16 de novembre de 2009, Baz va escriure:
> >> I've checked now... yes pdftotext with no flags will hit the new
> >> reading order code.
> >
> > And that is good or bad? :D
> 
> It turns out, good.
> 
> These are the results  of comparing the sizes of diffs to acrobat
> output  for poppler before and after the patch. The diff is just done
> on word order, to try to pick up paragraphs that have been misplaced.
> The filenames refer to the bugzillas where I found these: freedesktop,
> gnome, ubuntu launchpad, and kde.
> 
> (status, filename, unpatched, patched, difference)
> SAME fdo-18531-1.pdf 1215 1218 0%
> SAME gno-333967-1.pdf 971 971 0%
> PASS gno-360722-1.pdf 553 431 22%
> PASS gno-481825-1.pdf 2413 1582 34%
> PASS gno-494078-1.pdf 7494 5462 27%
> PASS gno-500352-1.pdf 11904 11204 5%
> FAIL gno-588476-1.pdf 1192 1277 -7%
> FAIL hig-2.0.pdf 3908 5057 -29%
> SAME kde-184399-1.pdf 159 159 0%
> SAME ubu-181737-1.pdf 18709 18724 0%
> FAIL ubu-251412-1.pdf 528 551 -4%
> PASS ubu-33288-2.pdf 2535 154 93%
> SAME ubu-346403-1.pdf 437 439 0%
> PASS ubu-367770-1.pdf 2955 2408 18%

Not sure i understand the numbers, do you mean that there are 6 documents that 
improve, 5 that are the same and 3 that are worse?

> 
> The 3 failures were largely due to numbered footnotes or tables; the
> body text was fine. So mostly the patched version is an improvement to
> reading order detection. If I can get the bullet points and numbers to
> be part of the correct block, those failures would go away.

Do you think you'll be able to get that done?

> 
> The test script, in case you want to try this on your corpus; I was
> running this in a directory of pdfs with a subdirectory 'acrobat' for
> my ground truth. I ignored non-ascii characters because the acrobat
> output was in win-1252.

That mans having to run acrobat by hand right? That means running that script 
on my pdf files is unmanageable, on the other hand i can run a script that 
compares old and new pdftotext output, if it's different i manually check if i 
think that's an improvement or not, hoping that there are not MANY files that 
are different :D

So do you want me to try that or you are working on a improved patch?

Albert

> 
> PDF=$1
> TXT=${PDF%%.pdf}.txt
> cp acrobat/$TXT first
> pdftotext $PDF second
> ~/poppler/utils/pdftotext $PDF third
> perl -i.bak -ne 'for(/([A-Za-z0-9.,;:]+)/g){print "$_\n";}' first second
>  third DIFF12=$(diff -udwb first second | wc -l)
> DIFF13=$(diff -udwb first third | wc -l)
> DIFF23=$(diff -udwb second third | wc -l)
> DIFF=$(expr \( 100 \* \( $DIFF12 - $DIFF13 \) \) / $DIFF12 )
> STATUS=SAME
> if [ $DIFF -gt 1 ]
> then
>     STATUS=PASS
> fi
> if [ $DIFF -lt -1 ]
> then
>     STATUS=FAIL
> fi
> 
> echo $STATUS $PDF $DIFF12 $DIFF13 ${DIFF}%
>