[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Fri Apr 15 18:39:16 PDT 2011

Taekyun Kim <podain77 at gmail.com> writes:

> I marked bubbles that I could find.
> Here we can make step 3 independent(or less dependent) from above step 6 and 7
> by proper allocation of registers.
> So we can insert some instructions of step 3 into the above bubble positions.
> Output of step 1(fetch dest) will be read in step 4 and output of step 2(fetch
> mask) will be read in step 3.
> So I think you can fetch mask first and then dest at the beginning of tail_head
> block and remaining bubbles can be filled with instructions from step 3.
>
> Maybe this does not work, or there can be some other better ways to achieve
> optimal performance.

Thanks - these comments were helpful. There is a new patch below that
implements these suggestions. I can't find anymore stalls in the inner
loop. This version does produce some measurable speedup with data in L1
cache compared to the non-pipelined version. From typeically 85-95
Mpixels/s to 90-100 Mpixels/s.

The precision of these measurements still leave something to be desired,
but it's pretty clear that there is some amount of improvement here.

Soren