[poppler] Testing Re: Multicolumn select

Sat Dec 26 02:57:09 PST 2009

2009/12/23 Albert Astals Cid <aacid at kde.org>:
> A Dilluns 14 Desembre 2009 01:21:54, Baz va escriure:
>> 2009/12/12 Albert Astals Cid <aacid at kde.org>:
>> > A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure:
>> >> 2009/12/9 Albert Astals Cid <aacid at kde.org>:
>> >> > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure:
>> >> >> 2009/12/8 Albert Astals Cid <aacid at kde.org>:
>> >> >> > What we want is something that makes text extraction/selection
>> >> >> > better, the definition of better is the problem here :D
>> >> >>
>> >> >> Ok. So it sounds like it would be worth adding tests in, so we can be
>> >> >> explicit about what we want text extraction to do.
>> >> >>
>> >> >> I could do this in two ways:
>> >> >> - write a test harness that calls the apis directly (following the
>> >> >> example of cairo). This has the advantage that more apis could be
>> >> >> tested later, but complicates writing the tests; and in any case most
>> >> >> other tests will be about rendering not text extraction. Since this
>> >> >> would be a unit test, its also fragile to API changes.
>> >> >> - extend pdftotext to allow me to specify start and end points for
>> >> >> text extraction (page,x,y). This would make writing tests easy - just
>> >> >> simple shell scripts along the lines of the git test suite. This
>> >> >> feature could be useful to end users too, I guess.
>> >> >>
>> >> >> I like the second plan better, since it supports building ad-hoc
>> >> >> tests with pdfs attached to bugs. Since we already have -f and -l,
>> >> >> (and -x, -y do something unrelated to the selection) I'm thinking of
>> >> >> int args -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth,
>> >> >> pageHeight).
>> >> >
>> >> > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted.
>> >>
>> >> Its not the same area. That mechanism crops every page from start to
>> >> finish to the same x,y,W,H box before dumping the text. Its useful for
>> >> removing header/footer sections in a whole-document dump. It also
>> >> doesn't hit the text selection code at all.
>> >
>> > I'm lost now, you originally said pdftotext was using your new code and
>> > now you say it doesn't?
>>
>> I am talking about how pdftotext works, whether or not you have my changes.
>>
>> pdftotext does not use xyWH for *selection* (the way it would work in
>> evince) it uses it to *crop*.
>>
>> However the text that it does ouput is in the same order as it would
>> be if you selected *all the text on the cropped page*.
>>
>> Ok it was misleading to say 'It also doesn't hit the text selection
>> code at all' I should have said 'it never passes text selection
>> coordinates other than those that would select *everything*'. So it
>> tests reading order but not the selection points in any meaningful
>> way.
>>
>> Does this make it clearer?
>
> Not really (sorry), what's the difference between getting all the text and then
> cropping the part we want and selecting just the part we want? Shouldn't the
> result be the same?

No. To take a simple example, suppose we have bidi text that displays
like this (lowercase characters are LTR, uppercase characters are
RTL):

abc IHG FED jkl

Dumping this as text should result in a string of characters with
unicode markers indicating where the text reverses:

abc <0x200F>DEF GHI<0x200E> jkl

Now lets crop this to the rectangle containing 'bc IH'. Dumping this
as text will result in:

bc <0x200F>HI<0x200E>

Note that the RTL chunk has had its start cut off and doesn't make
sense. If I make a reading-order selection with the same rectangle,
'bc' and 'G FED' should be highlighted, and the copied text should be:

bc <0x200F>DEF G<0x200E>

That shows how reading-order selection and cropping differ for a
single line of text, when we have bidirectional text. They also differ
for unidirectional text when the text spans more than one line, eg
given this display (in a fixed-width font):

abcdef
ghijkl

Crop this to the rectangle with 'b' at the top left and 'k' at the
bottom right. The dumped text will be 'bcde hijk'. Now copy and paste
a reading-order selection; the result will be 'bcdef ghijk'.

So even on a single page, with one or two lines, cropping and
selection give different answers.

----

BTW I'm reworking things again. Carlos thought this mechanism should
get its own API, and looking around at the way text selection works in
other software, it would make sense to build something closer to
at-spi's AccessibilityText, which also turns out to be a longstanding
evince bug:

http://www.gnome.org/~billh/at-spi-idl/html/interfaceAccessibility_1_1Text.html
http://projects.gnome.org/outreach/a11y/tasks/evince/

At the moment I'm trying to see if I can build this as a separate
'AccessibleOutputDev' which supports a subset of those operations, and
a command line utility for testing. Since this frees me from the
internal structures of TextOutputDev I'm going to build this for
tagged pdf first; that's the simple case where all the reading order
is already marked up. Then I'd put back in the code to guess the
layout for other pdfs.

Not going to progress quickly on this though as the family Christmas
is going on all around!

Merry Christmas
-Baz

>
> Albert
>
>>
>> > Albert
>> >
>> >> By contrast, a reading-order selection, even on a single page, may
>> >> include text that lies outside the rectangle from the startpoint to
>> >> the endpoint. Also, the xyWH mechanism applies the start/end points to
>> >> every page, instead of only the start/end page as you would with a
>> >> selection.
>> >>
>> >> -Baz
>> >>
>> >> > Albert
>> >> >
>> >> >> Does this sound useful to you?
>> >> >>
>> >> >> -Baz
>> >> >> _______________________________________________
>> >> >> poppler mailing list
>> >> >> poppler at lists.freedesktop.org
>> >> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>> >> >
>> >> > _______________________________________________
>> >> > poppler mailing list
>> >> > poppler at lists.freedesktop.org
>> >> > http://lists.freedesktop.org/mailman/listinfo/poppler
>> >>
>> >> _______________________________________________
>> >> poppler mailing list
>> >> poppler at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>> >
>> > _______________________________________________
>> > poppler mailing list
>> > poppler at lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>