[Pixman] [PATCH 2/2] MIPS: DSPr2: Added fast-paths for SRC operation.

Fri Feb 17 08:51:28 PST 2012

On Fri, Feb 17, 2012 at 5:51 PM, Lukic, Nemanja <nlukic at mips.com> wrote:
> Hi Siarhei,
>
> Board on which I collected those results is old PC-style evaluation board with MIPS CPU chip running at 1GHz (74Kc).
> More detailed information on cache for this board:
>  - Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
>  - Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
>  - MIPS secondary cache 512kB, 8-way, linesize 32 bytes.
>
> Your concerns for memory bandwidth make sense, but I don’t think this is related to timings and memory clock frequency configuration.
> This evaluation board uses old SDRAM, which lacks a lot in performance to modern DDR2/DDR3 memory chips, and thus influence overall peak memory bandwidth.

I did not mean to criticise your choice of hardware used for testing.
But having very slow RAM is just not very convenient to evaluate the
performance improvements from your optimizations.

For example, if we compare the performance of "src_x888_8888" and
"src_8888_8888" operations, then in your case it is 27.19 MPix/s vs.
30.32 MPix/s. Looks more or less the same, right?

But if we take a different hardware, such as my Asus RT-N16 router,
then it becomes 39.00 MPix/s vs. 79.89 MPix/s. The root cause is
apparently insufficient prefetch distance for the source buffer in
"src_x888_8888":

+LEAF(pixman_composite_src_x888_8888_asm_mips)
+/*
+ * a0 - dst (a8r8g8b8)
+ * a1 - src (x8r8g8b8)
+ * a2 - w
+ */
+
+    .set noreorder
+
+    beqz     a2, 4f
+     nop
+    li       t9, 0xff000000
+    srl      t8, a2, 3    /* t1 = how many multiples of 8 src pixels */
+    beqz     t8, 3f       /* branch if less than 8 src pixels */
+     nop
+1:
+    addiu    t8, t8, -1
+    beqz     t8, 2f
+     addiu   a2, a2, -8
+    pref     0, 32(a1)

Here it is  ^. And "pixman_mips_fast_memcpy" (used for
"src_8888_8888") implements source buffer prefetching better.

But I'm not going to complain about all the missing optimization
opportunities in your code at this point (unless you ask me to :) ).
Being faster than C is good enough for the start.

-- 
Best regards,
Siarhei Siamashka