[poppler] Multicolumn select

Tue Dec 8 15:38:09 PST 2009

A Dimarts 08 Desembre 2009 02:11:50, Baz va escriure:
> 2009/12/7 Albert Astals Cid <aacid at kde.org>:
> > Sorry for the late reply i've moved job and country meanwhile :D
> >
> > A Dilluns 23 Novembre 2009 09:37:42, Baz va escriure:
> >> 2009/11/18 Albert Astals Cid <aacid at kde.org>:
> >> > A Dilluns, 16 de novembre de 2009, Baz va escriure:
> >> >> I've checked now... yes pdftotext with no flags will hit the new
> >> >> reading order code.
> >> >
> >> > And that is good or bad? :D
> >>
> >> It turns out, good.
> >>
> >> These are the results  of comparing the sizes of diffs to acrobat
> >> output  for poppler before and after the patch. The diff is just done
> >> on word order, to try to pick up paragraphs that have been misplaced.
> >> The filenames refer to the bugzillas where I found these: freedesktop,
> >> gnome, ubuntu launchpad, and kde.
> >>
> >> (status, filename, unpatched, patched, difference)
> >> SAME fdo-18531-1.pdf 1215 1218 0%
> >> SAME gno-333967-1.pdf 971 971 0%
> >> PASS gno-360722-1.pdf 553 431 22%
> >> PASS gno-481825-1.pdf 2413 1582 34%
> >> PASS gno-494078-1.pdf 7494 5462 27%
> >> PASS gno-500352-1.pdf 11904 11204 5%
> >> FAIL gno-588476-1.pdf 1192 1277 -7%
> >> FAIL hig-2.0.pdf 3908 5057 -29%
> >> SAME kde-184399-1.pdf 159 159 0%
> >> SAME ubu-181737-1.pdf 18709 18724 0%
> >> FAIL ubu-251412-1.pdf 528 551 -4%
> >> PASS ubu-33288-2.pdf 2535 154 93%
> >> SAME ubu-346403-1.pdf 437 439 0%
> >> PASS ubu-367770-1.pdf 2955 2408 18%
> >
> > Not sure i understand the numbers, do you mean that there are 6 documents
> > that improve, 5 that are the same and 3 that are worse?
> 
> Yes. Although the size of the differences is also interesting: the
> ones that are better are significantly better. The only document
> that's significantly worse is the hig document, but looking at that in
> detail, this document is a dogs breakfast layout-wise, and goes wrong
> when selecting bullet-pointed sections. If I was using that as a copy
> & paste, it wouldn't bother me much as I could just delete the
> out-of-order bullet points. The numbering in the footnotes would be
> more of a problem as they would sound bad in a screenreader.
> 
> >> The 3 failures were largely due to numbered footnotes or tables; the
> >> body text was fine. So mostly the patched version is an improvement to
> >> reading order detection. If I can get the bullet points and numbers to
> >> be part of the correct block, those failures would go away.
> >
> > Do you think you'll be able to get that done?
> 
> Not in the short term. What time I've had to look at this I've been
> scratching my head over how to fit bidi selection into the code
> without breaking it too much. I was trying to get this:
> http://www.w3.org/TR/charmod/#sec-LogicalOrder
> I was hoping was that I could implement bidi selection, then move on
> to improving reading order further. However, bidi depends not just on
> the start and end position of the selection, and primaryLR, but also
> the writing direction in the words where the selection endpoints lie,
> /and/ requires words to be in reading order, which needs a bit of a
> rewrite. Which got me looking at the larger rewrite that would be
> needed to accomodate tagged pdf structures. None of that looks doable
> without changing some method signatures, which I've been avoiding up
> till now; I've got no idea what depends on TextOutputDev.h.

Here's the guideline, do not hesitate to change things, but don't change just 
for the fun of changing :D

> 
> >> The test script, in case you want to try this on your corpus; I was
> >> running this in a directory of pdfs with a subdirectory 'acrobat' for
> >> my ground truth. I ignored non-ascii characters because the acrobat
> >> output was in win-1252.
> >
> > That mans having to run acrobat by hand right? That means running that
> > script on my pdf files is unmanageable, on the other hand i can run a
> > script that compares old and new pdftotext output, if it's different i
> > manually check if i think that's an improvement or not, hoping that there
> > are not MANY files that are different :D
> 
> Now you see why I was hoping you had a test suite somewhere :)
> 
> I was thinking more along the lines of dumping text from a sample of
> documents. Also, you will almost certainly find that /most/ files are
> different; its the size of the difference that matters. The ones that
> seem to have greatly improved or worsened are the ones I'd look at
> manually.
> 
> > So do you want me to try that or you are working on a improved patch?
> 
> At the end of the day its up to you how you decide if this series is
> acceptable; for pdfs that aren't tagged there is no 'correct' reading
> order, only a best guess. I can point out the places where the current
> algorithm fails, but whether these errors are serious or not is a
> matter of opinion; any automated testing would have to work something
> like the perceptual-diff tests in cairo. I have no idea what your
> acceptance criteria are, so I can't say whether running a bunch of
> comparisons with acrobat would be useful to you. How did you decide
> that the /current/ text extraction code was ok?

It's the one that came with xpdf, so we (the poppler project) haven't touched 
that code at all i think (maybe we did some code about actual text but not 
much)

> 
> As for an improved patch...well like I said I was looking at other
> issues. The only completed fix I have that isn't uploaded was that in
> RTL docs, selections across page boundaries were wrong (evince passes
> the tl/br corners to signal start/end of page). Everything else is
> looking like bigger chunks of code, I don't want to do too much with
> until I've an idea whether these patches are getting closer to what
> you want.

What we want is something that makes text extraction/selection better, the 
definition of better is the problem here :D

> I'll take a look at the doc you mentioned in the other mail.

Thanks :-)

Albert

> 
> -Baz
> 
> > Albert
> >
> >> PDF=$1
> >> TXT=${PDF%%.pdf}.txt
> >> cp acrobat/$TXT first
> >> pdftotext $PDF second
> >> ~/poppler/utils/pdftotext $PDF third
> >> perl -i.bak -ne 'for(/([A-Za-z0-9.,;:]+)/g){print "$_\n";}' first second
> >>  third DIFF12=$(diff -udwb first second | wc -l)
> >> DIFF13=$(diff -udwb first third | wc -l)
> >> DIFF23=$(diff -udwb second third | wc -l)
> >> DIFF=$(expr \( 100 \* \( $DIFF12 - $DIFF13 \) \) / $DIFF12 )
> >> STATUS=SAME
> >> if [ $DIFF -gt 1 ]
> >> then
> >>     STATUS=PASS
> >> fi
> >> if [ $DIFF -lt -1 ]
> >> then
> >>     STATUS=FAIL
> >> fi
> >>
> >> echo $STATUS $PDF $DIFF12 $DIFF13 ${DIFF}%
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>