[Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

Wed Sep 16 07:30:44 PDT 2015

On Wed, 16 Sep 2015 12:25:59 +0100, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> I tried both "image" and "image16" on an x86_64 (Sandybridge), and got
> no performance differences in the trimmed-cairo-traces set in either
> baseline/cleanup or cleanup/tight.
>
> I also tried with PIXMAN_DISABLE=ssse3 and still got no difference. I
> verified I am really running what I think I am by editing Pixman and
> seeing the effect in the benchmark.
>
> Am I missing something?

Well, there has been a long and tangled history behind this change at
this point. I'll admit my twin motivations were code cleanup and making
it possible to measure the speed of bilinear operations using a
benchmarker, but it's also ended up spawning the creation of affine-bench
and cover-test, which isn't a bad thing.

Just by thinking things through, I realised that we would regularly fail
to hit COVER paths if Pixman's caller set the scale factors such that the
centre of the outermost destination pixels aligned with the centre of the
outermost source pixels. There has been some argument about whether this
is representative of how Pixman should be used. I happen to think this is
a perfectly reasonable thing to expect Pixman to support, but there are
other models you can follow, notably setting the scale factors such that
the outer edges of the outermost destination pixels align to the outer
edges of the outermost source pixels. If the Cairo traces are using this
latter model, then it's understandable if you aren't hitting the edge
case that I'm concerned about very often.

The useful thing about the scaled-in-one-axis-only example is that on the
axis that isn't scaled, there is no dispute about how you set the scale
factors, so it sidesteps this argument.

> Should I run the same on rpi2? Or is the best effect on the fast paths
> we haven't merged yet?

The get a sizeable speed difference, you need two things:
1) a plot geometry that aligns the centres of the high-coordinate pixels
of the source and destination
2) a reasonable speed difference between COVER and repeat fast paths or
iterators

At present, the speed difference will be most marked when you are using
the bilinear iterators from pixman-fast-path.c or pixman-ssse3.c, which
are both currently restricted to COVER plots of a8r8g8b8 source images
only. If you test on a Raspberry Pi 2, then you also need to allow for
the fact that the ARMv7 implementation has bilinear fast paths for
src_8888_8888, src_8888_0565, over_8888_8888 and add_8888_8888, for most
types of image repeat as well as for COVER, so these reduce the
applicability of the iterator still further.

In my patch series from last year, I added ARMv6 bilinear iterators for
COVER plots of a8r8g8b8, x8r8g8b8, r5g6b5 and a8 images. Of course, you
won't be seeing the effect of these because they haven't been accepted
yet - but when (I hope) they are then having the new COVER_CLIP
definition in place should help demonstrate their effectiveness.
Basically, these are all just pieces of one big jigsaw puzzle.

> Or maybe our test set is not enough? I recall having some problems with
> that in the past.

Yes, I know I have a number of fast paths queued up which were not
represented at all in the Cairo perf traces, but were being hammered by
the Epiphany browser on the Raspberry Pi. I'm not sure how we justify
their inclusion if the Cairo perf traces is the only criterion allowed. I
hope there will be some flexibility in this respect.

If you want a more real-world example of when you might encounter
bilinear scaling in one axis only, here's one that I think is plausible.
Consider screen grabs of taken of an NTSC SD video at ITU656 sample
positions (720x480) - or equally the output of a codec for a video at
that resolution. Now try to display it with bilinear scaling at the
correct aspect ratio on a display with square pixels: the destination
rectangle will likely be chosen to be 640x480, with a vertical pixel
increment of 1 and a horizontal pixel increment of 1.125. There's also a
reasonable chance that you'll also be wanting to re-plot this 30 times
per second, for obvious reasons. You'd hope that this could be achieved
using a COVER fast path or iterator, but with the current flag
definitions, they can't be used.

Ben