[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Soeren Sandmann sandmann at cs.au.dk
Fri Apr 15 18:39:16 PDT 2011


Taekyun Kim <podain77 at gmail.com> writes:

> I marked bubbles that I could find.
> Here we can make step 3 independent(or less dependent) from above step 6 and 7
> by proper allocation of registers.
> So we can insert some instructions of step 3 into the above bubble positions.
> Output of step 1(fetch dest) will be read in step 4 and output of step 2(fetch
> mask) will be read in step 3.
> So I think you can fetch mask first and then dest at the beginning of tail_head
> block and remaining bubbles can be filled with instructions from step 3.
>
> Maybe this does not work, or there can be some other better ways to achieve
> optimal performance.

Thanks - these comments were helpful. There is a new patch below that
implements these suggestions. I can't find anymore stalls in the inner
loop. This version does produce some measurable speedup with data in L1
cache compared to the non-pipelined version. From typeically 85-95
Mpixels/s to 90-100 Mpixels/s.

The precision of these measurements still leave something to be desired,
but it's pretty clear that there is some amount of improvement here.


Soren




More information about the Pixman mailing list