[Pixman] [PATCH 5/5] test: Add a new benchmarker targeting affine operations

Tue Apr 14 03:28:42 PDT 2015

On Wed, 08 Apr 2015 02:27:48 +0100
"Ben Avison" <bavison at riscosopen.org> wrote:

> On Tue, 17 Mar 2015 11:12:53 -0000, Pekka Paalanen <ppaalanen at gmail.com> wrote:

> > If there is no transform, why not return the original extents?
> > These have been reduced by a half unit in all four directions.
> 
> It's more obviously relevant in the bilinear scaling case, but a pixel's
> colour is considered to apply to the midpoint of each pixel. Thus an
> image of width N is specifying colours at ordinates 0.5, 1.5, 2.5 ...
> (N-1.5), (N-0.5). Therefore, when you're working out which source pixels
> are required to generate a range of output pixels, it's 0.5 and (N-0.5)
> that you need to feed through any transformation matrix. If the no
> transform case were special-cased, you'd then need to special-case it
> again in the calling function because the 0.5 offset applies (albeit at
> different scaling factors) to both source and destination images, so the
> caller will be adding the 0.5 pixel border back on again.

Ok, so the extents going in to compute_transformed_extents() have a
different meaning than the ones coming out. The ones going in define
the edges of the image, while the ones coming out are reduced to the
corner pixel centers.

Which means that it is incorrect to, say, compute extents for two
sequential transforms one by one but you must combine the transforms
first and then call compute_transformed_extents() just once.

I was just surprised by this convention.

> In affine-bench, I calculate image sizes based on bilinear scaling for
> simplicity, since a source or mask image big enough to satisfy COVER_CLIP
> for bilinear scaling will also do so for nearest scaling.
> 
> [compute_transformed_extents]
> >> +    tx1 = ty1 = INT64_MAX;
> >> +    tx2 = ty2 = INT64_MIN;
> >> +
> >> +    for (i = 0; i < 4; ++i)
> >> +    {
> >> +        pixman_fixed_48_16_t tx, ty;
> >> +        pixman_vector_t v;
> >> +
> >> +        v.vector[0] = (i & 0x01)? x1 : x2;
> >> +        v.vector[1] = (i & 0x02)? y1 : y2;
> >> +        v.vector[2] = pixman_fixed_1;
> >> +
> >> +        if (!pixman_transform_point (transform, &v))
> >> +            return FALSE;
> >> +
> >> +        tx = (pixman_fixed_48_16_t)v.vector[0];
> >> +        ty = (pixman_fixed_48_16_t)v.vector[1];
> >
> > This works only for affine. This is called affine-bench, so that is
> > natural, yet might be nice to add an assert or something to guarantee
> > no-one will copy this code to something else that uses projective
> > transforms.
> 
> I might be missing something, but it looks to me like
> pixman_transform_point (through its subroutine
> pixman_transform_point_31_16) does handle projective transforms, and what
> this code segment is doing is iterating over the four corners of the
> destination rectangle and finding the maximum and minimum source x/y by
> considering the transformed position of each corner independently. Surely
> the whole set of destination pixels, once passed through even a
> projective matrix, must still lie within this bounding box?

Ahh, another case where I had wrong assumptions.
pixman_transform_point() indeed does guarantee, that v.vector[2] is
1.0. I'm used to assume it's not guaranteed because the guarantee
leads to a division by zero for points at infinity. But Pixman cannot
deal with points at infinity anyway, so it sort of makes sense in
relation to pixel images.

> [flush_cache]
> >> +    volatile const char *x = clean_space;
> [...]
> >> +        (void) *x;
> >
> > Are you sure this actually does the read? That the compiler is not
> > allowed to discard the read? That's something I'm never sure of.
> 
> That's the point of the volatile qualifier - the compiler isn't allowed
> to throw away accesses to an object which is declared volatile, it has to
> perform exactly as many accesses as you ask for and in the same order.
>  From the C standard:
> 
>     "A volatile declaration may be used to describe an object
>     corresponding to a memory-mapped input/output port or an object
>     accessed by an asynchronously interrupting function. Actions on
>     objects so declared shall not be 'optimized out' by an
>     implementation or reordered except as permitted by the rules for
>     evaluating expressions."
> 
> What you do need to be careful of these days is that this says nothing
> about the number of or ordering of accesses as viewed by any part of the
> system beyond the L1 cache, including any other cores - that's where
> memory barriers come in. But that's not relevant for the purposes of
> cleaning the cache.

Very good.

> >> +        x += 32;
> >
> > Why 32? Is this CACHELINE_LENGTH from lowlevel-blt-bench.c?
> 
> Yes, that's the origin. You might argue for reading the cacheline length
> at runtime, as with cache and page size, but it's less critical here. If
> your cacheline length was actually 64 bytes, then reading every 32nd byte
> would still pull in every cacheline. Reading every single byte would have
> the same effect but be a bit of a waste of time; the increment only needs
> to be as short as the shortest cacheline you're likely to run with. I
> guess I didn't want to call it CACHELINE_LENGTH so as not to worry people
> who know their cachelines are different, and since the constant is only
> used once in a function called flush_cache I hoped it would be obvious
> what it was for.

Your revised patch addresses any concers I might have had here.

> > It might also be worth considering adding support for clock_gettime,
> > because that allows you to use CLOCK_MONOTONIC or even
> > CLOCK_MONOTOMIC_RAW. Though, maybe that's not necessary since below you
> > use a 5 second run time.
> 
> In other places I've had more luck removing noise from benchmarking
> results by using getrusage, though that doesn't have very good
> granularity for short tests like this one. It seemed least likely to
> raise objections if I stuck to the existing mechanisms that Pixman
> already uses though...

Agreed. I also missed the point that clock() returns the processor time
used by the process, I was only looking... at... gettimeofday() which
uses the wall clock... hrmm, that's a mismatch.

Yes, there might be ways to improve this, but it's not a matter for
this patch. Let's stick to the old conventions for now.

> 
> [bench]
> >> +        t->matrix[0][2] = pixman_int_to_fixed (src_x + x);
> >> +        t->matrix[1][2] = pixman_int_to_fixed (src_y);
> >> +        pixman_image_set_transform (src_image, t);
> >
> > Hmm, what's the idea here with the transform?
> 
> It's another trick borrowed from lowlevel-blt-bench, and ensures we don't
> unfairly test any one relation of starting pixel to cacheline boundary.
> If I remember right, sticking the source offset in the transform's
> translation column was to ensure it's applied after rather than before
> any scaling factors (since we're concerned with cacheline alignment
> rather than pixel alignment).

Yes. So the matrices are indexed as matrix[row][column] in Pixman?
Looking at pixman_transform_init_translate(), that seems to be the case.

> The fact that we loop over values of x from 0 to 63 is why we over-
> allocate the source, mask and destination images all by 64 pixels, which
> you asked about in a few other places. Once again I was taking my cue
>  from lowlevel-blt-bench by the selection of the number 64, and using
> literal 63 and 64 constants in the code rather than a symbolic name. I'm
> guessing the choice was originally chosen to suit 8bpp images and 64-byte
> cachelines, or 4bpp images and 32-byte cachelines.

Ok.

A short'ish comment somewhere to say why you are doing this offsetting
would be nice, and that the offsetting is the reason to allocate a
margin.

> [main]
> > All this nesting should be refactored away by using a helper function
> > with an early return for each else-branch.
> 
> I'm not sure the helper function looks a whole lot neater, since it needs
> lots of arguments to return things through, but I've had a go.

It looks much better now. It's not indenting out of hands, the code
flow is easier to read, and you easily see what happens in the
else-cases. The code structure matches the idea: a flat sequence of
parsing operations.

> > I don't see invalid arguments detected at all?
> 
> Taking the lead from lowlevel-blt-bench again. All the Pixman tests seem
> to assume you know more-or-less what you're doing :) I've added some
> checks anyway.
> 
> > Should we be able to give, say, src type without giving all the
> > matrix elements? That doesn't work atm.
> 
> In practice, affine plots are usually achieved by a combination of scaled
> fetcher, combiner and writeback functions, and you're most likely to be
> only benchmarking one of those at a time. I was working on the basis that
> "1 0 0 1" wasn't too much to have to type if you're testing different
> combiners, and if you're testing different pixel formats for the fetcher
> or writeback functions, you only need "src" for the combine type. The
> argument that you're by far the most likely to need to try with lots of
> different settings is the X scale, and so I quite deliberately put that
> one first!

As it makes sense to you, suits me. :-)

> >> +    xmin = pixman_fixed_to_int (transformed.x1 - 8 * pixman_fixed_e - pixman_fixed_1 / 2);
> >
> > I was wondering what is going on here with the 8*e, grepped for
> > pixman_fixed_e, and noticed that this code comes from pixman.c. I think
> > you should copy also the comments, not only code.
> 
> Back in my original 32-patch series, this gets changed by a later patch.
> However, in order to refactor this to the start of the series, I had to
> back-port this fudge factor to match the one in pixman.c - otherwise you
> end up getting an enormous speed hit because the operations we generate
> fail to match the conditions for the COVER_CLIP flag being set.

I see you added a short comment, that's good.

> It's perhaps not the best place to discuss it here, but my arguments for
> removing the fudge factor stem from the fact that pixels are sampled at
> half-integer coordinates, as I described earlier. This means that if you
> use the most obvious 2x enlargement transformation (a matrix with
> diagonal coefficients of 0.5) then you end up with this pattern, where
> s[] is an array of source pixels and d[] is an array of destination
> pixels:
> 
> d[0] = 0.25 * s[-1] + 0.75 * s[0]
> d[1] =                0.75 * s[0] + 0.25 * s[1]
> d[2] =                0.25 * s[0] + 0.75 * s[1]
> d[3] =                              0.75 * s[1] + 0.25 * s[2]
> 
> Of note is the fact that if your starting X coordinate is 0 (which is
> likely if you're plotting an entire image) then you end up with a small
> contribution in the first destination pixel from outside the source
> image, which requires COVER_CLIP to be unset and the image repeat type to
> be considered, which means we can't use a simple scaled fetcher, meaning
> that the whole plot is slowed down. In the case of ARMv6, this is a large
> factor, because we really don't have enough spare registers to be
> checking for repeats inside the inner loop of the scaled fetchers.
> 
> One simple way round this is to apply a 0.5 pixel translation in the
> transformation matrix, so that the pattern becomes:
> 
> d[0] =                1.00 * s[0]
> d[1] =                0.50 * s[0] + 0.50 * s[1]
> d[2] =                              1.00 * s[1]
> d[3] =                              0.50 * s[1] + 0.50 * s[2]
> 
> but this is thwarted by the 8/65536ths of a pixel fudge factor. I can't
> see why the fudge factor is needed at all, at least not in the affine
> case; it's described as being to compensate for rounding errors, but
> there is no rounding involved in fixed-point affine transforms.

Maybe the rounding refers to the pixman_fixed_to_int() calls? I could
imagine it is to guarantee that an xmin=0.5 gets really rounded to 0,
and xmax=0.5 gets rounded to 1.

To be honest, both cases you describe sound strange to me. Surely if I
use an affine transform matrix { 0.5 0 0; 0 1 0 }, I'd expect
d[0] = 0.5 * s[0] + 0.5 * s[1]
when assuming the box filter (if I got my terminology right)...

Well, this is complicated and not something we can change either, so
never mind. As long as the bench program hits the code paths you want,
it's fine.

> >> +        flush_cache ();
> >
> > I wonder if flush_cache here has any meaning... bench is doing multiple
> > rounds, so flush_cache affects only the first round. Is that intended?
> 
> I think my hope was that since we're processing images of size 1920 *
> 1080 pixels, you wouldn't have any particular pixels persisting in the
> cache from the previous iteration (unless perhaps if you're testing a
> high-factor enlargement on a platform that has caches that don't allocate
> on write).

Right, so is there any need to flush_cache() at all?

> >> +        bench (op, &t, src_image, mask_image, dest_image, src_x, src_y, UINT32_MAX, 5000000, &n, &t1, pixman_image_composite_wrapper);
> >> +        bench (op, &t, src_image, mask_image, dest_image, src_x, src_y, n, UINT32_MAX, NULL, &t2, pixman_image_composite_empty);
> >> +        /* The result indicates the output rate in megapixels/second */
> >> +        printf ("%6.2f\n", (double) n * WIDTH * HEIGHT / (t1 - t2));
> >> +    }
> >
> > I find it a little odd to have a block like this. Maybe other code
> > should be moved into functions instead? Or this code?
> 
> It's deliberately doing "nothing" just as many times as it did
> "something" for the first 5 seconds - it's so it can cancel out the time
> spent doing operation dispatch inside Pixman. It's similar to the
> technique used by lowlevel-blt-bench - except that it used memcpy to
> estimate the relative speed of the platform, with the results that some
> operations could be very slow to benchmark.

I think that part I understood. My comment was about having a {} block
without any flow control statements for it. You are just using it to
scope some variables, which suggests that the block should actually be
a separate function.

There would be lots of arguments to that function, yes, which makes me
wonder why lowlevel-blt-bench doesn't use a struct to pass the test
setup parameters around.

> > Did you check how many rounds a Raspberry Pi 1 can do? Does it do more
> > than two in 5 seconds with an affine transform that is nearly identity
> > but not quite?
> 
> The Pi 1 manages about 90 (nearest) or 14 (bilinear).
> The Pi 2 manages about 250 (nearest) or 80 (bilinear).
> Sounds like plenty to me.

Yeah, I think that should be enough.

Thanks,
pq