[Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

Tue May 15 14:37:53 PDT 2012

On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic <nemanja.lukic at rt-rk.com> wrote:
> Hi Siarhei,
>
> I implemented a new version of the (patch below) BILINEAR_INTERPOLATE_SINGLE_PIXEL macro where ANDI/EXT instructions,
> are substituted with load byte instructions (for better dual-issue instruction balancing) and got these results on my Malta board:
>
> Original:
> [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6
> Opt (ANDI/EXT)
> [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6
> Opt2 (load byte instructions)
> [  0]    image             firefox-fishtank 1671.700 1672.006   0.03%    4/4
>
> There is performance improvement, but not impressive as I expected.

This is interesting. Have you tried to check where the time is
actually spent and what is the performance bottleneck?

If the code from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro is taken and
put into unrolled loop so that the total number of iterations is equal
to CPU clock frequency, then we get:

MIPS 74K and ANDI/EXT variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL:

real	0m40.279s
user	0m40.260s
sys	0m0.006s

MIPS 74K and LBU variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL:

real	0m26.479s
user	0m26.468s
sys	0m0.006s

That's ~40 cycles vs. ~26.5 cycles, or approximately ~13.5 cycles
saving by replacing 16 ALU instructions with 16 LS instructions.
Ideally we would want to have perfect dual issue here and 16 cycles
saving, but at least dual issue works.

> And now code also becomes vulnerable to endianess of the target CPUs.
> Of course, this can be guarded with some #ifdef's where byte offset in a word is changed according to the endianess of the target CPU (since MIPS CPUs can be both LE and BE).
> Is this small improvement worth making this code vulnerable to endian issues?

If you are already satisfied with this level of performance, then it's
probably fine for now.

> I still need to add improvement for that packing/unpacking of the RGBA pixels after bilinear/before OVER operation, but I don't expect big improvement there (it is just a couple of instructions).

It's not just a couple of instructions. By combining the color
channels in a register, you are also forcing the processor to finish
the calculations for all the needed data. And this is an extra data
dependency, which may inhibit instructions reordering.

But big improvements are not likely to happen unless there is a clear
understanding about what is going on in the CPU pipeline and
accounting each spent cycle.

-- 
Best regards,
Siarhei Siamashka