[Pixman] [PATCH 0/2] Better instruction scheduling of solid source operator OVER

Thu Aug 18 01:29:04 PDT 2011

On Tue, Aug 16, 2011 at 4:34 PM, Taekyun Kim <podain77 at gmail.com> wrote:
> I tried to improve performance of solid geometry filling.
> Pipeline stalls were eliminated as far as I could find.
> Performance of L1 and L2 was improved but memory bench results are still the same.

Thanks, that's good. By the way, do you have some other a bit less
synthetic benchmarks for geometry filling workload, which could show
some measurable performance improvement?

Either way, performance optimizations for ARM NEON are still very much welcome.

> One interesting thing was dual issue of ARM(mainly cache preload) and NEON on cortex-a9.
> It doesn't seem to be that effective as on cortex-a8.

That's because Cortex-A8 has a rather large queue for NEON
instructions. The instruction decoder handles 2 instructions per cycle
and if it sees NEON instructions, it just moves them to the NEON
queue. The NEON queue has capacity for ~20 instructions and NEON unit
can execute just only a bit more than 1 NEON instruction per cycle on
average. So starting with a completely full NEON queue, NEON unit can
be kept busy in the background for a while even if you insert a large
block of ARM instructions before feeding it with the next block of
NEON instructions again. Independent ARM and NEON instructions can be
mixed freely with very few limitations. This design is very good for
performance, but has one caveat: if you need to read back the result
from NEON to ARM, you naturally get a huge ~20 cycles stall just
because NEON instructions are executed with a significant delay
relative to ARM pipeline.

This has been changed in ARM Cortex-A9 and there is no large detached
NEON queue anymore. Cortex-A9 does not have a huge penalty for reading
back the results, but mixing ARM and NEON instructions efficiently is
a lot more difficult now.

Regarding scheduling NEON instructions for Cortex-A9, it is mostly
identical to Cortex-A8 with a few most notable differences:
1. It can't dual issue pairs of NEON instructions
2. Mixing ARM and NEON instructions efficiently is more difficult
3. NEON load/store instructions are much slower

One especially important thing when optimizing code for Cortex-A9 is
that any memory loads suffer from a rather big stall if used
immediately after NEON store (and the more data was written by that
NEON instruction, the bigger is the stall). So software pipelining is
still very relevant for Cortex-A9, because the following
straightforward non-pipelined loop exactly hits this problem:

1:
    /* a serious pipeline stall here on Cortex-A9, because loading
immediately after store */
    vld1.8 {d0, d1, d2, d3}, [SRC, :128]! /* load input data */
    /* calculate something */
    vst1.8 {d4, d5, d6, d7}, [SRC, :128]! /* store output data */
    subs LOOP_COUNTER, LOOP_COUNTER, #1
    bne  1b

It does not matter if you are reading memory with ARM or NEON
instructions after NEON store or what memory addresses are accessed.
Pipelining allows to change the order of "calculate something" and
"store output data" parts, so that we have enough non-load/store
instructions after NEON store and avoid this stall.

Also back-to-back NEON load/store instructions are slow on Cortex-A9
and interleaving them with some other instructions is important. For
Cortex-A8 we would do that to get dual issue benefits. For Cortex-A9
this is needed to avoid suffering from unnecessary penalties. Either
way, it's good that the same trick helps on both processors and they
don't need separate NEON code paths yet :)

> I need further investigation on it.

You can also try to run benchmarks using with different prefetch types
in 'pixman-arm-neon-asm.S' by changing:
    .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_ADVANCED
to either of
    .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_NONE
    .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_SIMPLE

Cortex-A9 has automatic hardware prefetcher. And if it is enabled
(there are some hardware bugs in early Cortex-A9 revisions, so it may
be disabled for a good reason), then PREFETCH_TYPE_NONE may be a good
choice.

> Below is the results of lowlevel-blt-bench
>
> << over_n_8_8888 >>
> - cortex a8 -
> before : L1: 201.51  L2: 195.59  M:105.15
> after  : L1: 251.56  L2: 257.06  M:107.23
>
> - cortex a9 -
> before : L1: 131.80  L2: 132.91  M:125.56
> after  : L1: 180.52  L2: 182.45  M:131.73

This looks definitely good. Cortex-A8 clearly had (and still has) a
lot of headroom with its NEON unit being much faster than memory, but
Cortex-A9 performance was just barely matching memory bandwidth.

> << over_n_8888 >>
> - cortex a8 -
> before : L1: 372.71  L2: 391.26  M:114.49
> after  : L1: 490.81  L2: 484.38  M:114.27
>
> - cortex a9 -
> before : L1: 268.69  L2: 277.31  M:151.69
> after  : L1: 302.58  L2: 309.04  M:151.56

Both Cortex-A8 and Cortex-A9 seem to have a lot of headroom here, but
having this fast path optimized even better is still nice.

In any case, now things are a little bit more interesting because with
the latest ARM hardware we got a downgraded NEON unit in Cortex-A9
(or speaking positively, should we say "balanced"?) and somewhat
better memory bandwidth.

-- 
Best regards,
Siarhei Siamashka