[poppler] Testing Re: Multicolumn select

Baz brian.ewins at gmail.com
Wed Dec 9 15:22:09 PST 2009


2009/12/9 Albert Astals Cid <aacid at kde.org>:
> A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure:
>> 2009/12/8 Albert Astals Cid <aacid at kde.org>:
>> > What we want is something that makes text extraction/selection better,
>> > the definition of better is the problem here :D
>>
>> Ok. So it sounds like it would be worth adding tests in, so we can be
>> explicit about what we want text extraction to do.
>>
>> I could do this in two ways:
>> - write a test harness that calls the apis directly (following the
>> example of cairo). This has the advantage that more apis could be
>> tested later, but complicates writing the tests; and in any case most
>> other tests will be about rendering not text extraction. Since this
>> would be a unit test, its also fragile to API changes.
>> - extend pdftotext to allow me to specify start and end points for
>> text extraction (page,x,y). This would make writing tests easy - just
>> simple shell scripts along the lines of the git test suite. This
>> feature could be useful to end users too, I guess.
>>
>> I like the second plan better, since it supports building ad-hoc tests
>> with pdfs attached to bugs. Since we already have -f and -l, (and -x,
>> -y do something unrelated to the selection) I'm thinking of int args
>> -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth, pageHeight).
>
> Why isn't x,y,W,H enough? AFAIR they define which area gets extracted.

Its not the same area. That mechanism crops every page from start to
finish to the same x,y,W,H box before dumping the text. Its useful for
removing header/footer sections in a whole-document dump. It also
doesn't hit the text selection code at all.

By contrast, a reading-order selection, even on a single page, may
include text that lies outside the rectangle from the startpoint to
the endpoint. Also, the xyWH mechanism applies the start/end points to
every page, instead of only the start/end page as you would with a
selection.

-Baz

>
> Albert
>
>>
>> Does this sound useful to you?
>>
>> -Baz
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


More information about the poppler mailing list