[Pixman] [PATCH 3/5] ARMv6: New blit routines

Wed Jan 23 09:13:23 PST 2013

On Wed, 23 Jan 2013 15:16:49 -0000
"Ben Avison" <bavison at riscosopen.org> wrote:

> On Wed, 23 Jan 2013 13:22:19 -0000, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> 
> > The L1/L2 numbers are bogus, because ARM11 (and Cortex-A8 in default
> > configuration) does not allocate data into cache on write misses.
> > There is some code here to take read-allocate cache into account:
> >     http://cgit.freedesktop.org/pixman/tree/test/lowlevel-blt-bench.c?id=pixman-0.28.2#n178
> > But it uses 64 byte steps (which are bigger than 32 bytes cache line
> > in ARM11), and also it does such cache "prefetch" only for the first
> > scanline.
> 
> Those are some very good points. I've been seeing terrible variability in
> L1 and L2 profile results in particular - hence all the statistical post-
> processing I've been doing on the results to try to make sense of them -
> and this could help explain that.
> 
> I had seen the code that attempts to ensure that the source and
> destination buffers are in the cache (notice, it should really be
> fetching the mask buffer as well - there's another fault with it).

In fact it is the fetch for the source buffer redundant there, kind of
    http://en.wikipedia.org/wiki/Cargo_cult_programming
We only need to keep fetching the destination buffer into cache as the
comment says in the code.

The problem with the distorted benchmark statistics happens because:
1. We start with both source and destination buffers in CPU cache
2. The benchmark happily runs for N iterations with both source and
   destination buffers still in cache
3. We are preempted by some other process, which performs some very
   short activity, but thrashes the CPU caches
4. The next iteration of our benchmark will pull the source buffer back
   into cache (because cache lines are allocated on read misses), but
   will keep missing cache for the destination buffer
5. The benchmark runs for the rest of iterations with the destination
   buffer not in CPU cache anymore (the source buffer is fine)

Depending on the difference between the time spent in (2) and the time
spent in (5), the results may vary by a large margin.

> However, I had overlooked the fact that "dst" and "src" are
> "uint32_t *"s, so the index increment of 16 is indeed 64 bytes, as you
> say, and therefore misses out half of the ARM11 L1 cache lines. The
> failure to pull in subsequent scan lines in the L2 cache test also seems
> like a major oversight, and one which will have distorted profile results
> on *all* platforms, not just ARM.

All the other platforms (even MIPS32) normally use write-allocate cache.
So the only affected processors are ARM processors up to Cortex-A8.
Actually Cortex-A8 also can configure the cache as write-allocate, but
overall performance just gets worse, so nobody uses it. As far as I
know, newer ARM processors have to use write-allocate cache because
this is needed for SMP.

> What would people's view be on fixing these bugs in lowlevel-blt-bench?
> Obviously it would affect the numbers people get when running it on all
> platforms.

The fixes are very much welcome :)

There are also other problems with lowlevel-blt-bench, such as:
1) Very erratic time needed to run the test
2) Missing stddev estimation
3) The size of the buffer used for tests (1920x1080x32bpp) is not
   so large, modern x86 processors have huge L3 caches nowadays.
4) The hardcoded tables of compositing operations, while we could
   just parse the provided string and extract most of the needed
   information from it (compositing operator, source, destination
   and mask format, ...)

> While it would be nice to be able to adapt the stride in that bit of code
> inside bench_L() to suit the cache line length of the platform in use,
> but I fear that's unlikely to be possible in a platform- and OS-
> independent way. In which case, a lowest-common-denominator approach
> would seem to be the best bet; does anyone have a feeling for what the
> shortest cache lines are likely to be across all platforms supported by
> pixman?

32 seems to be a reasonable choice. For some processors/platforms
it is even possible to get the information about the cache
size/properties at runtime (with CPUID instruction or something else).

> > Still the numbers for M test are interesting and demonstrate that using
> > LDM with ORR arithmetic is better for unaligned reads.
> 
> I can't help thinking there must be more to it than that. I expected that
> when using preloads, the data would effectively already be in the L1
> cache by the time the LDR/LDM instructions acted on it, and from
> benchmarking of a simple test case I wrote (which was deliberately
> confined to the L1 cache) I saw the opposite result. Is there something
> wrong with my understanding of how preloads operate then?

We can't know anything for sure. But in any case, I guess unaligned LDR
instructions may be just occupying some resources of L1 cache and
starving the L2->L1 cache line fills done concurrently by PLD.

> Of course, using unaligned LDRs also had the nice benefit of not needing
> an additional register to hold the data prior to doing the shifting, so
> changing to LDM and shifts would make a mess of my register allocation
> strategy :(

We are usually either CPU performance limited or memory performance
limited. The unaligned reads are not a problem for 32bpp compositing
operations. So they may only affect 16bpp fast paths. And I guess any
16bpp fast path other than SRC operator is going to be likely just CPU
limited, so memory bandwidth is less important.

The 16bpp SRC fast paths can be always implemented separately, even if
they do not fit your framework nicely.

-- 
Best regards,
Siarhei Siamashka