[poppler] Testing Re: Multicolumn select

Albert Astals Cid aacid at kde.org
Wed Dec 23 13:58:43 PST 2009


A Dilluns 14 Desembre 2009 01:21:54, Baz va escriure:
> 2009/12/12 Albert Astals Cid <aacid at kde.org>:
> > A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure:
> >> 2009/12/9 Albert Astals Cid <aacid at kde.org>:
> >> > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure:
> >> >> 2009/12/8 Albert Astals Cid <aacid at kde.org>:
> >> >> > What we want is something that makes text extraction/selection
> >> >> > better, the definition of better is the problem here :D
> >> >>
> >> >> Ok. So it sounds like it would be worth adding tests in, so we can be
> >> >> explicit about what we want text extraction to do.
> >> >>
> >> >> I could do this in two ways:
> >> >> - write a test harness that calls the apis directly (following the
> >> >> example of cairo). This has the advantage that more apis could be
> >> >> tested later, but complicates writing the tests; and in any case most
> >> >> other tests will be about rendering not text extraction. Since this
> >> >> would be a unit test, its also fragile to API changes.
> >> >> - extend pdftotext to allow me to specify start and end points for
> >> >> text extraction (page,x,y). This would make writing tests easy - just
> >> >> simple shell scripts along the lines of the git test suite. This
> >> >> feature could be useful to end users too, I guess.
> >> >>
> >> >> I like the second plan better, since it supports building ad-hoc
> >> >> tests with pdfs attached to bugs. Since we already have -f and -l,
> >> >> (and -x, -y do something unrelated to the selection) I'm thinking of
> >> >> int args -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth,
> >> >> pageHeight).
> >> >
> >> > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted.
> >>
> >> Its not the same area. That mechanism crops every page from start to
> >> finish to the same x,y,W,H box before dumping the text. Its useful for
> >> removing header/footer sections in a whole-document dump. It also
> >> doesn't hit the text selection code at all.
> >
> > I'm lost now, you originally said pdftotext was using your new code and
> > now you say it doesn't?
> 
> I am talking about how pdftotext works, whether or not you have my changes.
> 
> pdftotext does not use xyWH for *selection* (the way it would work in
> evince) it uses it to *crop*.
> 
> However the text that it does ouput is in the same order as it would
> be if you selected *all the text on the cropped page*.
> 
> Ok it was misleading to say 'It also doesn't hit the text selection
> code at all' I should have said 'it never passes text selection
> coordinates other than those that would select *everything*'. So it
> tests reading order but not the selection points in any meaningful
> way.
> 
> Does this make it clearer?

Not really (sorry), what's the difference between getting all the text and then 
cropping the part we want and selecting just the part we want? Shouldn't the 
result be the same?

Albert

> 
> > Albert
> >
> >> By contrast, a reading-order selection, even on a single page, may
> >> include text that lies outside the rectangle from the startpoint to
> >> the endpoint. Also, the xyWH mechanism applies the start/end points to
> >> every page, instead of only the start/end page as you would with a
> >> selection.
> >>
> >> -Baz
> >>
> >> > Albert
> >> >
> >> >> Does this sound useful to you?
> >> >>
> >> >> -Baz
> >> >> _______________________________________________
> >> >> poppler mailing list
> >> >> poppler at lists.freedesktop.org
> >> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >> >
> >> > _______________________________________________
> >> > poppler mailing list
> >> > poppler at lists.freedesktop.org
> >> > http://lists.freedesktop.org/mailman/listinfo/poppler
> >>
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 


More information about the poppler mailing list