[poppler] Testing Re: Multicolumn select

Sun Dec 13 17:21:54 PST 2009

2009/12/12 Albert Astals Cid <aacid at kde.org>:
> A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure:
>> 2009/12/9 Albert Astals Cid <aacid at kde.org>:
>> > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure:
>> >> 2009/12/8 Albert Astals Cid <aacid at kde.org>:
>> >> > What we want is something that makes text extraction/selection better,
>> >> > the definition of better is the problem here :D
>> >>
>> >> Ok. So it sounds like it would be worth adding tests in, so we can be
>> >> explicit about what we want text extraction to do.
>> >>
>> >> I could do this in two ways:
>> >> - write a test harness that calls the apis directly (following the
>> >> example of cairo). This has the advantage that more apis could be
>> >> tested later, but complicates writing the tests; and in any case most
>> >> other tests will be about rendering not text extraction. Since this
>> >> would be a unit test, its also fragile to API changes.
>> >> - extend pdftotext to allow me to specify start and end points for
>> >> text extraction (page,x,y). This would make writing tests easy - just
>> >> simple shell scripts along the lines of the git test suite. This
>> >> feature could be useful to end users too, I guess.
>> >>
>> >> I like the second plan better, since it supports building ad-hoc tests
>> >> with pdfs attached to bugs. Since we already have -f and -l, (and -x,
>> >> -y do something unrelated to the selection) I'm thinking of int args
>> >> -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth, pageHeight).
>> >
>> > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted.
>>
>> Its not the same area. That mechanism crops every page from start to
>> finish to the same x,y,W,H box before dumping the text. Its useful for
>> removing header/footer sections in a whole-document dump. It also
>> doesn't hit the text selection code at all.
>
> I'm lost now, you originally said pdftotext was using your new code and now
> you say it doesn't?

I am talking about how pdftotext works, whether or not you have my changes.

pdftotext does not use xyWH for *selection* (the way it would work in
evince) it uses it to *crop*.

However the text that it does ouput is in the same order as it would
be if you selected *all the text on the cropped page*.

Ok it was misleading to say 'It also doesn't hit the text selection
code at all' I should have said 'it never passes text selection
coordinates other than those that would select *everything*'. So it
tests reading order but not the selection points in any meaningful
way.

Does this make it clearer?

>
> Albert
>
>>
>> By contrast, a reading-order selection, even on a single page, may
>> include text that lies outside the rectangle from the startpoint to
>> the endpoint. Also, the xyWH mechanism applies the start/end points to
>> every page, instead of only the start/end page as you would with a
>> selection.
>>
>> -Baz
>>
>> > Albert
>> >
>> >> Does this sound useful to you?
>> >>
>> >> -Baz
>> >> _______________________________________________
>> >> poppler mailing list
>> >> poppler at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>> >
>> > _______________________________________________
>> > poppler mailing list
>> > poppler at lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>