[Pixman] [PATCH 5/5] test: Add a new benchmarker targeting affine operations

Tue Apr 7 18:27:48 PDT 2015

On Tue, 17 Mar 2015 11:12:53 -0000, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> affine-bench should be added to .gitignore too.

OK.

>> +#define L2CACHE_SIZE (128 * 1024)
>
> I see lowlevel-blt-bench.c also uses L2CACHE_SIZE, but where does it
> come from? If it's really an arch/platform-independent constant, maybe
> some test header could have it?

As you will no doubt have noticed, affine-bench is heavily inspired by
lowlevel-blt-bench. Admittedly it's a bit of a cheat to assume one L2
cache size, but in lowlevel-blt-bench, it's only used for sizing the
images used for the L2 test. For those purposes, it's only important that
it's a number big enough that the images are flushed out of the L1 cache
but small enough that they aren't flushed out to memory, so as long as
it's bigger than the largest L1 cache you're likely to encounter and no
smaller than the smallest L2 cache, then it'll do, and saves lots of
messy OS-specific code to determine the cache sizes.

Admittedly, now I look back on the use of the same constant in
affine-bench, I've taken the point of view that it's too difficult in the
presence of significant transformation coefficients to calculate image
sizes that sit within a particular cache, so I'm only bothering to test
the memory-constrained case. To do that, I'm using L2CACHE_SIZE to size
an array used to flush the L2 cache - but that needs the *largest*
typical L2 cache size, not the smallest, so 128K won't cut it here.

While I have easy access to the cache sizes on CPUs that I'm actively
targeting, a general-purpose benchmarker should probably be suitable for
all architectures. Should we read it from the OS somehow? (Anyone know
how best to do that for, say, Linux, Windows, OS X and iOS, in that case?)

> Looks like Pixman style is using separate statements for struct
> definition and typedefs.

Not exclusively - I think the bits I've looked at usually do, but now
that I count them, I grant you the most common form is separate typedefs.
This particular example was cut-and-pasted from pixman.c but can easily
be changed.

>> +static pixman_bool_t
>> +compute_transformed_extents (pixman_transform_t *transform,
>> +                             const pixman_box32_t *extents,
>> +                             box_48_16_t *transformed)
>
> CODING_STYLE is asking for alignment here for the arg names.

This was also a cut-and-paste, but good point.

> If there is no transform, why not return the original extents?
> These have been reduced by a half unit in all four directions.

It's more obviously relevant in the bilinear scaling case, but a pixel's
colour is considered to apply to the midpoint of each pixel. Thus an
image of width N is specifying colours at ordinates 0.5, 1.5, 2.5 ...
(N-1.5), (N-0.5). Therefore, when you're working out which source pixels
are required to generate a range of output pixels, it's 0.5 and (N-0.5)
that you need to feed through any transformation matrix. If the no
transform case were special-cased, you'd then need to special-case it
again in the calling function because the 0.5 offset applies (albeit at
different scaling factors) to both source and destination images, so the
caller will be adding the 0.5 pixel border back on again.

In affine-bench, I calculate image sizes based on bilinear scaling for
simplicity, since a source or mask image big enough to satisfy COVER_CLIP
for bilinear scaling will also do so for nearest scaling.

[compute_transformed_extents]
>> +    tx1 = ty1 = INT64_MAX;
>> +    tx2 = ty2 = INT64_MIN;
>> +
>> +    for (i = 0; i < 4; ++i)
>> +    {
>> +        pixman_fixed_48_16_t tx, ty;
>> +        pixman_vector_t v;
>> +
>> +        v.vector[0] = (i & 0x01)? x1 : x2;
>> +        v.vector[1] = (i & 0x02)? y1 : y2;
>> +        v.vector[2] = pixman_fixed_1;
>> +
>> +        if (!pixman_transform_point (transform, &v))
>> +            return FALSE;
>> +
>> +        tx = (pixman_fixed_48_16_t)v.vector[0];
>> +        ty = (pixman_fixed_48_16_t)v.vector[1];
>
> This works only for affine. This is called affine-bench, so that is
> natural, yet might be nice to add an assert or something to guarantee
> no-one will copy this code to something else that uses projective
> transforms.

I might be missing something, but it looks to me like
pixman_transform_point (through its subroutine
pixman_transform_point_31_16) does handle projective transforms, and what
this code segment is doing is iterating over the four corners of the
destination rectangle and finding the maximum and minimum source x/y by
considering the transformed position of each corner independently. Surely
the whole set of destination pixels, once passed through even a
projective matrix, must still lie within this bounding box?

[create_image]
> Too long line, needs breaking up.

OK.

[various places]
> Add empty line.

OK.

>> +    *bits = aligned_malloc (4096, stride * height);
>
> Is there a #define to get the 4096 from? I assume it's page size? What
> if the platform has big pages?

I assume the same thing. It's quite a common page size as I understand
it. I know it's not much excuse, but I was copying lowlevel-blt-bench
here, which also uses an explicit 4096 in its aligned_malloc call.

This kind of comes under the same category as the L2 cache size. What are
people's opinions on adding OS-specific code to discover this sort of
thing?

[flush_cache]
>> +    volatile const char *x = clean_space;
[...]
>> +        (void) *x;
>
> Are you sure this actually does the read? That the compiler is not
> allowed to discard the read? That's something I'm never sure of.

That's the point of the volatile qualifier - the compiler isn't allowed
to throw away accesses to an object which is declared volatile, it has to
perform exactly as many accesses as you ask for and in the same order.
 From the C standard:

    "A volatile declaration may be used to describe an object
    corresponding to a memory-mapped input/output port or an object
    accessed by an asynchronously interrupting function. Actions on
    objects so declared shall not be 'optimized out' by an
    implementation or reordered except as permitted by the rules for
    evaluating expressions."

What you do need to be careful of these days is that this says nothing
about the number of or ordering of accesses as viewed by any part of the
system beyond the L1 cache, including any other cores - that's where
memory barriers come in. But that's not relevant for the purposes of
cleaning the cache.

>> +        x += 32;
>
> Why 32? Is this CACHELINE_LENGTH from lowlevel-blt-bench.c?

Yes, that's the origin. You might argue for reading the cacheline length
at runtime, as with cache and page size, but it's less critical here. If
your cacheline length was actually 64 bytes, then reading every 32nd byte
would still pull in every cacheline. Reading every single byte would have
the same effect but be a bit of a waste of time; the increment only needs
to be as short as the shortest cacheline you're likely to run with. I
guess I didn't want to call it CACHELINE_LENGTH so as not to worry people
who know their cachelines are different, and since the constant is only
used once in a function called flush_cache I hoped it would be obvious
what it was for.

>> +/* Obtain current time in microseconds modulo 2^32 */
>> +uint32_t
>> +gettimei (void)
>> +{
>> +#ifdef HAVE_GETTIMEOFDAY
>> +    struct timeval tv;
>> +
>> +    gettimeofday (&tv, NULL);
>> +    return tv.tv_sec * 1000000 + tv.tv_usec;
>> +#else
>> +    return (double) clock () / (double) CLOCKS_PER_SEC;
>
> This returns seconds; copy-pasta from utils.c?

Oops, yes, you caught me. Since the final results are in megapixels per
second (= pixels per microsecond) it seemed pointless to divide the time
by 1000000, then divide by 1000000 again after the time had been used in
the denominator of the speed calculation, especially since this avoids
having to go via floating-point intermediate values.

> It might also be worth considering adding support for clock_gettime,
> because that allows you to use CLOCK_MONOTONIC or even
> CLOCK_MONOTOMIC_RAW. Though, maybe that's not necessary since below you
> use a 5 second run time.

In other places I've had more luck removing noise from benchmarking
results by using getrusage, though that doesn't have very good
granularity for short tests like this one. It seemed least likely to
raise objections if I stuck to the existing mechanisms that Pixman
already uses though...

[bench]
>> +        t->matrix[0][2] = pixman_int_to_fixed (src_x + x);
>> +        t->matrix[1][2] = pixman_int_to_fixed (src_y);
>> +        pixman_image_set_transform (src_image, t);
>
> Hmm, what's the idea here with the transform?

It's another trick borrowed from lowlevel-blt-bench, and ensures we don't
unfairly test any one relation of starting pixel to cacheline boundary.
If I remember right, sticking the source offset in the transform's
translation column was to ensure it's applied after rather than before
any scaling factors (since we're concerned with cacheline alignment
rather than pixel alignment).

The fact that we loop over values of x from 0 to 63 is why we over-
allocate the source, mask and destination images all by 64 pixels, which
you asked about in a few other places. Once again I was taking my cue
 from lowlevel-blt-bench by the selection of the number 64, and using
literal 63 and 64 constants in the code rather than a symbolic name. I'm
guessing the choice was originally chosen to suit 8bpp images and 64-byte
cachelines, or 4bpp images and 32-byte cachelines.

[main]
> All this nesting should be refactored away by using a helper function
> with an early return for each else-branch.

I'm not sure the helper function looks a whole lot neater, since it needs
lots of arguments to return things through, but I've had a go.

> I don't see invalid arguments detected at all?

Taking the lead from lowlevel-blt-bench again. All the Pixman tests seem
to assume you know more-or-less what you're doing :) I've added some
checks anyway.

> Should we be able to give, say, src type without giving all the
> matrix elements? That doesn't work atm.

In practice, affine plots are usually achieved by a combination of scaled
fetcher, combiner and writeback functions, and you're most likely to be
only benchmarking one of those at a time. I was working on the basis that
"1 0 0 1" wasn't too much to have to type if you're testing different
combiners, and if you're testing different pixel formats for the fetcher
or writeback functions, you only need "src" for the combine type. The
argument that you're by far the most likely to need to try with lots of
different settings is the X scale, and so I quite deliberately put that
one first!

>> +    xmin = pixman_fixed_to_int (transformed.x1 - 8 * pixman_fixed_e - pixman_fixed_1 / 2);
>
> I was wondering what is going on here with the 8*e, grepped for
> pixman_fixed_e, and noticed that this code comes from pixman.c. I think
> you should copy also the comments, not only code.

Back in my original 32-patch series, this gets changed by a later patch.
However, in order to refactor this to the start of the series, I had to
back-port this fudge factor to match the one in pixman.c - otherwise you
end up getting an enormous speed hit because the operations we generate
fail to match the conditions for the COVER_CLIP flag being set.

It's perhaps not the best place to discuss it here, but my arguments for
removing the fudge factor stem from the fact that pixels are sampled at
half-integer coordinates, as I described earlier. This means that if you
use the most obvious 2x enlargement transformation (a matrix with
diagonal coefficients of 0.5) then you end up with this pattern, where
s[] is an array of source pixels and d[] is an array of destination
pixels:

d[0] = 0.25 * s[-1] + 0.75 * s[0]
d[1] =                0.75 * s[0] + 0.25 * s[1]
d[2] =                0.25 * s[0] + 0.75 * s[1]
d[3] =                              0.75 * s[1] + 0.25 * s[2]

Of note is the fact that if your starting X coordinate is 0 (which is
likely if you're plotting an entire image) then you end up with a small
contribution in the first destination pixel from outside the source
image, which requires COVER_CLIP to be unset and the image repeat type to
be considered, which means we can't use a simple scaled fetcher, meaning
that the whole plot is slowed down. In the case of ARMv6, this is a large
factor, because we really don't have enough spare registers to be
checking for repeats inside the inner loop of the scaled fetchers.

One simple way round this is to apply a 0.5 pixel translation in the
transformation matrix, so that the pattern becomes:

d[0] =                1.00 * s[0]
d[1] =                0.50 * s[0] + 0.50 * s[1]
d[2] =                              1.00 * s[1]
d[3] =                              0.50 * s[1] + 0.50 * s[2]

but this is thwarted by the 8/65536ths of a pixel fudge factor. I can't
see why the fudge factor is needed at all, at least not in the affine
case; it's described as being to compensate for rounding errors, but
there is no rounding involved in fixed-point affine transforms.

>> +        flush_cache ();
>
> I wonder if flush_cache here has any meaning... bench is doing multiple
> rounds, so flush_cache affects only the first round. Is that intended?

I think my hope was that since we're processing images of size 1920 *
1080 pixels, you wouldn't have any particular pixels persisting in the
cache from the previous iteration (unless perhaps if you're testing a
high-factor enlargement on a platform that has caches that don't allocate
on write).

>> +        bench (op, &t, src_image, mask_image, dest_image, src_x, src_y, UINT32_MAX, 5000000, &n, &t1, pixman_image_composite_wrapper);
>> +        bench (op, &t, src_image, mask_image, dest_image, src_x, src_y, n, UINT32_MAX, NULL, &t2, pixman_image_composite_empty);
>> +        /* The result indicates the output rate in megapixels/second */
>> +        printf ("%6.2f\n", (double) n * WIDTH * HEIGHT / (t1 - t2));
>> +    }
>
> I find it a little odd to have a block like this. Maybe other code
> should be moved into functions instead? Or this code?

It's deliberately doing "nothing" just as many times as it did
"something" for the first 5 seconds - it's so it can cancel out the time
spent doing operation dispatch inside Pixman. It's similar to the
technique used by lowlevel-blt-bench - except that it used memcpy to
estimate the relative speed of the platform, with the results that some
operations could be very slow to benchmark.

> Did you check how many rounds a Raspberry Pi 1 can do? Does it do more
> than two in 5 seconds with an affine transform that is nearly identity
> but not quite?

The Pi 1 manages about 90 (nearest) or 14 (bilinear).
The Pi 2 manages about 250 (nearest) or 80 (bilinear).
Sounds like plenty to me.

Ben