[Pixman] [PATCH] MIPS: DSPr2: Added more fast-paths for OVER operation: - over_8888_n_8888 - over_8888_n_0565 - over_0565_n_0565 - over_8888_8_8888 - over_8888_8_0565 - over_0565_8_0565

Sun Aug 19 14:05:30 PDT 2012

On Mon,  6 Aug 2012 07:21:54 +0200
Nemanja Lukic <nlukic at mips.com> wrote:

>Performance numbers before/after on MIPS-74kc @ 1GHz:
>
>lowlevel-blt-bench results

Hi, thanks for the patch.

Just as with the previous patch, the summary line is way too long. It
is also an indirect indication that the patch is better to be split
into a few independent parts (for better bisecting if nothing else).

>Referent (before):
>        over_8888_n_8888 =  L1:   9.92  L2:  11.27  M:  8.50 ( 45.23%)
>        over_8888_n_0565 =  L1:   8.95  L2:   8.33  M:  6.95 ( 27.74%)
>        over_0565_n_0565 =  L1:   7.56  L2:   7.24  M:  6.16 ( 16.38%)
>        over_8888_8_8888 =  L1:  12.54  L2:  10.86  M:  8.18 ( 54.36%)
>        over_8888_8_0565 =  L1:   8.86  L2:   8.11  M:  6.72 ( 35.71%)
>        over_0565_8_0565 =  L1:   7.43  L2:   7.05  M:  5.98 ( 23.85%)
>
>Optimized:
>        over_8888_n_8888 =  L1:  28.02  L2:  24.92  M: 14.72 ( 78.15%)
>        over_8888_n_0565 =  L1:  18.76  L2:  17.55  M: 13.11 ( 52.19%)
>        over_0565_n_0565 =  L1:  15.47  L2:  14.52  M: 12.30 ( 32.65%)
>        over_8888_8_8888 =  L1:  26.92  L2:  23.93  M: 13.65 ( 90.58%)
>        over_8888_8_0565 =  L1:  18.14  L2:  16.79  M: 12.10 ( 64.25%)
>        over_0565_8_0565 =  L1:  15.47  L2:  14.61  M: 11.78 ( 46.92%)

As for the performance numbers. I wonder how much faster would these
new specialized MIPS fast paths be if we had a DSPr2 optimized OVER
combiner? You can check "sse2_combine_over_u" and "neon_combine_over_u"
functions as examples of existing combiners.

Adding many fast path functions does not scale very well. It increases
code size, but only covers a small fraction of possible compositing
operations. Adding just a single combiner function increases the
performance for nearly all the uses of OVER operator. Albeit to a
smaller extent and unless C fast paths are taken instead of the general
path.

Moreover, I still think that it makes a lot of sense to first
attempt to implement the code which provides the best performance
for OVER operator and only then replicate it to multiple fast path
functions. This means, implementing OVER combiner and optimizing
the hell out of it would be a really good start.

The same applies to bilinear scaling, and focusing on best performing
bilinear src_8888_8888 should provide a good starting point for a decent
generic bilinear fetcher. But it's a bit more complicated (separable
processing may need some high level changes). In any case, I'm going to
try implementing fast bilinear scaling for ARM11 (Raspberry Pi), and
parts of it may be also useful for MIPS.

Moving forward, if you split your patch and it passes tests (BTW, is
your system little endian?), then it should be fine for now. But likely
not very future proof due to the reasons explained above.

-- 
Best regards,
Siarhei Siamashka