[Pixman] [PATCH 3/5] ARMv6: New blit routines

Wed Jan 23 07:16:49 PST 2013

On Wed, 23 Jan 2013 13:22:19 -0000, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:

> The L1/L2 numbers are bogus, because ARM11 (and Cortex-A8 in default
> configuration) does not allocate data into cache on write misses.
> There is some code here to take read-allocate cache into account:
>     http://cgit.freedesktop.org/pixman/tree/test/lowlevel-blt-bench.c?id=pixman-0.28.2#n178
> But it uses 64 byte steps (which are bigger than 32 bytes cache line
> in ARM11), and also it does such cache "prefetch" only for the first
> scanline.

Those are some very good points. I've been seeing terrible variability in
L1 and L2 profile results in particular - hence all the statistical post-
processing I've been doing on the results to try to make sense of them -
and this could help explain that.

I had seen the code that attempts to ensure that the source and
destination buffers are in the cache (notice, it should really be
fetching the mask buffer as well - there's another fault with it).
However, I had overlooked the fact that "dst" and "src" are
"uint32_t *"s, so the index increment of 16 is indeed 64 bytes, as you
say, and therefore misses out half of the ARM11 L1 cache lines. The
failure to pull in subsequent scan lines in the L2 cache test also seems
like a major oversight, and one which will have distorted profile results
on *all* platforms, not just ARM.

What would people's view be on fixing these bugs in lowlevel-blt-bench?
Obviously it would affect the numbers people get when running it on all
platforms.

While it would be nice to be able to adapt the stride in that bit of code
inside bench_L() to suit the cache line length of the platform in use,
but I fear that's unlikely to be possible in a platform- and OS-
independent way. In which case, a lowest-common-denominator approach
would seem to be the best bet; does anyone have a feeling for what the
shortest cache lines are likely to be across all platforms supported by
pixman?

> Still the numbers for M test are interesting and demonstrate that using
> LDM with ORR arithmetic is better for unaligned reads.

I can't help thinking there must be more to it than that. I expected that
when using preloads, the data would effectively already be in the L1
cache by the time the LDR/LDM instructions acted on it, and from
benchmarking of a simple test case I wrote (which was deliberately
confined to the L1 cache) I saw the opposite result. Is there something
wrong with my understanding of how preloads operate then?

Of course, using unaligned LDRs also had the nice benefit of not needing
an additional register to hold the data prior to doing the shifting, so
changing to LDM and shifts would make a mess of my register allocation
strategy :(

Ben