[poppler] Testing Re: Multicolumn select

Albert Astals Cid aacid at kde.org
Fri Dec 11 21:44:47 PST 2009


A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure:
> 2009/12/9 Albert Astals Cid <aacid at kde.org>:
> > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure:
> >> 2009/12/8 Albert Astals Cid <aacid at kde.org>:
> >> > What we want is something that makes text extraction/selection better,
> >> > the definition of better is the problem here :D
> >>
> >> Ok. So it sounds like it would be worth adding tests in, so we can be
> >> explicit about what we want text extraction to do.
> >>
> >> I could do this in two ways:
> >> - write a test harness that calls the apis directly (following the
> >> example of cairo). This has the advantage that more apis could be
> >> tested later, but complicates writing the tests; and in any case most
> >> other tests will be about rendering not text extraction. Since this
> >> would be a unit test, its also fragile to API changes.
> >> - extend pdftotext to allow me to specify start and end points for
> >> text extraction (page,x,y). This would make writing tests easy - just
> >> simple shell scripts along the lines of the git test suite. This
> >> feature could be useful to end users too, I guess.
> >>
> >> I like the second plan better, since it supports building ad-hoc tests
> >> with pdfs attached to bugs. Since we already have -f and -l, (and -x,
> >> -y do something unrelated to the selection) I'm thinking of int args
> >> -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth, pageHeight).
> >
> > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted.
> 
> Its not the same area. That mechanism crops every page from start to
> finish to the same x,y,W,H box before dumping the text. Its useful for
> removing header/footer sections in a whole-document dump. It also
> doesn't hit the text selection code at all.

I'm lost now, you originally said pdftotext was using your new code and now 
you say it doesn't?

Albert

> 
> By contrast, a reading-order selection, even on a single page, may
> include text that lies outside the rectangle from the startpoint to
> the endpoint. Also, the xyWH mechanism applies the start/end points to
> every page, instead of only the start/end page as you would with a
> selection.
> 
> -Baz
> 
> > Albert
> >
> >> Does this sound useful to you?
> >>
> >> -Baz
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 


More information about the poppler mailing list