[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Wed Apr 13 13:55:35 PDT 2011

On Wed, Apr 13, 2011 at 4:39 AM, Soeren Sandmann <sandmann at cs.au.dk> wrote:
> Hi,
>
>> I'm also a bit new to NEON, however I hope this brief review can be
>> helpful to you.  So let's start.
>
> Thanks for the comments. I'm including a new version shortly that
> incorporates some of the changes, but unfortunately, I couldn't reliably
> measure much of an improvement. On my system, the L1 value from
> lowlevel-blt-bench seems to fluctuate between 85 and 95 pretty much no
> matter what I do.

A bit tricky part is that the erratic behavior may be caused by
different things. On recent ARM Cortex-A8 cores, NEON memory accesses
do not allocate data into L1 cache, but work with L2 directly. Another
problem shows up when write-allocate is not used, so that any writes
that miss the cache are actually bypassing the cache and go to the
memory through a write combining buffer. The practical effect of this
all when doing benchmarking is that in multitasking environment some
other process may get control for a tiny fraction of time, but thrash
the caches and cause the data set used by the benchmark program to be
evicted from cache (to be never allocated to cache again until the end
of benchmark). This may cause serious benchmark results distortion
unless something is done to negate these effects. The code in
'bench_L' function already tries to forcefully pull the data into L1
cache for the source and destination on each iteration, but looks like
it forgets about mask which might be the cause. Also similar things to
workaround memory/cache effects is done in a standalone standalone
test program attached at:
http://lists.freedesktop.org/archives/pixman/2011-April/001188.html

BTW, another tweak there is the call overhead compensation
(EXCLUDE_OVERHEAD define), because when using pixman in a normal way,
this overhead is so high that it seriously distorts benchmark results
for very small images (such as performance measurement for L1 cached
data).

> I think the main lesson is that we want to be serious about this level
> of microoptimization, need more a more robust way of measuring
> performance than lowlevel-blt-bench provides. In particular more
> iteratons and tracking of standard deviations would be useful.

More iterations or standard deviation may be not very applicable in
the case if we don't have a normal distribution, but some random
numbers somewhere in the range between the real L1 and real L2
performance. Though surely standard deviation is a good idea once the
source of randomness is eliminated, or even just for M benchmark.

> Also, I noticed that lowlevel-blt-bench copies the destination back into
> the source in various places:
>
>    memcpy (src, dst, BUFSIZE);
>    memcpy (dst, src, BUFSIZE);
>
> This could affect performance if dst ends up containing mostly zeros. In
> the cases where the source is supposed to be solid, it could mean
> operations like over_n_8_8888() will turn into noops completely. (This
> happened to me in this case due to a bug in the over_n_8888_0565_ca()
> code).

This lowlevel-blt-bench program was primarily done for checking NEON
optimizations, which do not handle 0 of FF values in any special way
at the moment. It has problems for the other implementations and
surely could be improved to use separate benchmarking for various
common use cases such as some sort of 'translucent', 'transparent' and
'opaque'.

> I think the order of those memcpy() calls should at a minimum be
> reversed, but I'm not completely clear on why they are there in the
> first place. To put the sources and destinations into cache perhaps?

Yes, this was done to bring the CPU caches in a more or less
consistent state between benchmarks, even though it's a bit
questionable whether this actually helps anything.

Anyway, my current belief is that L1 performance and to some extent L2
performance is quite difficult to benchmark even in such synthetic
test. Which is the basis for the empirical rule to mostly care about
the CPU being able to fully saturate all the available memory
bandwidth in every standard nonscaled fast path. Also for very small
images, the overhead in the other parts of pixman even before reaching
NEON code plays quite a significant role.

-- 
Best regards,
Siarhei Siamashka