[poppler] Multicolumn select

Mon Dec 7 18:11:50 PST 2009

2009/12/7 Albert Astals Cid <aacid at kde.org>:
> Sorry for the late reply i've moved job and country meanwhile :D
>
> A Dilluns 23 Novembre 2009 09:37:42, Baz va escriure:
>> 2009/11/18 Albert Astals Cid <aacid at kde.org>:
>> > A Dilluns, 16 de novembre de 2009, Baz va escriure:
>> >> I've checked now... yes pdftotext with no flags will hit the new
>> >> reading order code.
>> >
>> > And that is good or bad? :D
>>
>> It turns out, good.
>>
>> These are the results  of comparing the sizes of diffs to acrobat
>> output  for poppler before and after the patch. The diff is just done
>> on word order, to try to pick up paragraphs that have been misplaced.
>> The filenames refer to the bugzillas where I found these: freedesktop,
>> gnome, ubuntu launchpad, and kde.
>>
>> (status, filename, unpatched, patched, difference)
>> SAME fdo-18531-1.pdf 1215 1218 0%
>> SAME gno-333967-1.pdf 971 971 0%
>> PASS gno-360722-1.pdf 553 431 22%
>> PASS gno-481825-1.pdf 2413 1582 34%
>> PASS gno-494078-1.pdf 7494 5462 27%
>> PASS gno-500352-1.pdf 11904 11204 5%
>> FAIL gno-588476-1.pdf 1192 1277 -7%
>> FAIL hig-2.0.pdf 3908 5057 -29%
>> SAME kde-184399-1.pdf 159 159 0%
>> SAME ubu-181737-1.pdf 18709 18724 0%
>> FAIL ubu-251412-1.pdf 528 551 -4%
>> PASS ubu-33288-2.pdf 2535 154 93%
>> SAME ubu-346403-1.pdf 437 439 0%
>> PASS ubu-367770-1.pdf 2955 2408 18%
>
> Not sure i understand the numbers, do you mean that there are 6 documents that
> improve, 5 that are the same and 3 that are worse?

Yes. Although the size of the differences is also interesting: the
ones that are better are significantly better. The only document
that's significantly worse is the hig document, but looking at that in
detail, this document is a dogs breakfast layout-wise, and goes wrong
when selecting bullet-pointed sections. If I was using that as a copy
& paste, it wouldn't bother me much as I could just delete the
out-of-order bullet points. The numbering in the footnotes would be
more of a problem as they would sound bad in a screenreader.

>>
>> The 3 failures were largely due to numbered footnotes or tables; the
>> body text was fine. So mostly the patched version is an improvement to
>> reading order detection. If I can get the bullet points and numbers to
>> be part of the correct block, those failures would go away.
>
> Do you think you'll be able to get that done?

Not in the short term. What time I've had to look at this I've been
scratching my head over how to fit bidi selection into the code
without breaking it too much. I was trying to get this:
http://www.w3.org/TR/charmod/#sec-LogicalOrder
I was hoping was that I could implement bidi selection, then move on
to improving reading order further. However, bidi depends not just on
the start and end position of the selection, and primaryLR, but also
the writing direction in the words where the selection endpoints lie,
/and/ requires words to be in reading order, which needs a bit of a
rewrite. Which got me looking at the larger rewrite that would be
needed to accomodate tagged pdf structures. None of that looks doable
without changing some method signatures, which I've been avoiding up
till now; I've got no idea what depends on TextOutputDev.h.

>
>>
>> The test script, in case you want to try this on your corpus; I was
>> running this in a directory of pdfs with a subdirectory 'acrobat' for
>> my ground truth. I ignored non-ascii characters because the acrobat
>> output was in win-1252.
>
> That mans having to run acrobat by hand right? That means running that script
> on my pdf files is unmanageable, on the other hand i can run a script that
> compares old and new pdftotext output, if it's different i manually check if i
> think that's an improvement or not, hoping that there are not MANY files that
> are different :D

Now you see why I was hoping you had a test suite somewhere :)

I was thinking more along the lines of dumping text from a sample of
documents. Also, you will almost certainly find that /most/ files are
different; its the size of the difference that matters. The ones that
seem to have greatly improved or worsened are the ones I'd look at
manually.

> So do you want me to try that or you are working on a improved patch?

At the end of the day its up to you how you decide if this series is
acceptable; for pdfs that aren't tagged there is no 'correct' reading
order, only a best guess. I can point out the places where the current
algorithm fails, but whether these errors are serious or not is a
matter of opinion; any automated testing would have to work something
like the perceptual-diff tests in cairo. I have no idea what your
acceptance criteria are, so I can't say whether running a bunch of
comparisons with acrobat would be useful to you. How did you decide
that the /current/ text extraction code was ok?

As for an improved patch...well like I said I was looking at other
issues. The only completed fix I have that isn't uploaded was that in
RTL docs, selections across page boundaries were wrong (evince passes
the tl/br corners to signal start/end of page). Everything else is
looking like bigger chunks of code, I don't want to do too much with
until I've an idea whether these patches are getting closer to what
you want.

I'll take a look at the doc you mentioned in the other mail.

-Baz

>
> Albert
>
>>
>> PDF=$1
>> TXT=${PDF%%.pdf}.txt
>> cp acrobat/$TXT first
>> pdftotext $PDF second
>> ~/poppler/utils/pdftotext $PDF third
>> perl -i.bak -ne 'for(/([A-Za-z0-9.,;:]+)/g){print "$_\n";}' first second
>>  third DIFF12=$(diff -udwb first second | wc -l)
>> DIFF13=$(diff -udwb first third | wc -l)
>> DIFF23=$(diff -udwb second third | wc -l)
>> DIFF=$(expr \( 100 \* \( $DIFF12 - $DIFF13 \) \) / $DIFF12 )
>> STATUS=SAME
>> if [ $DIFF -gt 1 ]
>> then
>>     STATUS=PASS
>> fi
>> if [ $DIFF -lt -1 ]
>> then
>>     STATUS=FAIL
>> fi
>>
>> echo $STATUS $PDF $DIFF12 $DIFF13 ${DIFF}%
>>
>