[Pixman] Better memory bandwidth utilization in pixman ARM NEON optimizations

Tue Dec 14 08:32:04 PST 2010

Hello,

After checking pixman sse2 code for over_8888_8888 compositing operation
recently and having a look at the code responsible for handling fully
transparent and fully opaque special cases, I decided to re-evaluate the
use of the same trick for ARM NEON.

Actually it was proposed by Soeren a year ago, but I discarded this idea at
that time, mostly because of a slow 20 cycle NEON to ARM data transfer on
ARM Cortex-A8 [1] which seemed to be a really bad problem.

But now the clock speed of ARM processors keeps increasing, while memory
bandwidth does not seem to catch up at all. This makes disparity between
the speed of NEON unit and the speed of memory interface even worse than
it used to be, so revisiting this stuff makes a lot of sense now.

The operations such as OVER and some of the others read pixel data
from both source and destination images, do some calculations and write
the result back to the destination. There are two special cases possible:
a) when some source pixel is fully opaque for OVER operation, then it can be 
directly copied to the destination, and we don't need to read anything from
the destination in this case.
b) when some source pixel is fully transparent, no modification to the 
destination is needed, and we may omit both the read and write for the 
corresponding pixel in the destination buffer.

Both of these optimizations may save precious memory bandwidth and improve
overall performance.

Because of SIMD processing, we work with the blocks having size of at least 8
pixels or more. So the above rules are have to be modified to require all
of the pixels in the source block to be fully opaque or transparent at once.
This makes it less useful for some of the use cases. Such as the rendering of
text where opaque/transparent pixels are interleaved frequently, making no
long strides of the pixels having the same state. But still there are two
possible use cases which may be important in practice:
a) use of a8r8g8b8 image which in fact is fully opaque due to some reasons
(to work better with other backends, for code simplicity, or just because of
lack of thought about fast performing code). There is an example of such
proposal for mozilla [2], the other applications may have some uses of fully
opaque images too
b) a huge, mostly transparent picture, with only some small icon or image drawn 
on it. This may probably happen if some application decides to composite
multiple screen or window sized transparent layers together without trying
to confine the operation only to cover the areas where we have some
non-transparent pixels.

Now regarding ARM NEON code, we have two challenges to overcome:
* The already mentioned NEON to ARM transfer needs to be handled so that it 
does not cause a huge performance penalty. Fortunately, pipelining seems to be
solving this problem nicely (more about this below).
* Because of the software prefetch, skipping read of the destination image
pixels does not bring much improvement, moreover there is just no improvement
at all on the newest OMAP3 and OMAP4 devices. This happens because we are
prefetching data into cache far ahead and just skipping read instruction is not
providing memory bandwidth saving (the data was already fetched into cache by
this time). So in order to actually improve performance for the case (a), we
need to disable prefetch for the destination image altogether. This helps a lot
for this particular use case (the performance becomes the same as for SRC
operation), but causes a noticeable performance regression when dealing with
the translucent source images. So we end up having two implementations, each 
tuned for its own use case. Selecting between these implementations needs to be 
done somehow, one of the ways *might* be to use the value of the source pixel 
in the top left corner to do an empirical guess. Or maybe implement something
more sophisticated (letting the imagination go wild, we may try to even collect
pixel type statistics at runtime and dynamically switch between 
implementations).

Anyway, the initial step may be some code which at least skips writing to
the destination image when it is not needed, providing up to 20% improvement
when dealing with the fully transparent source images. 

A proof of concept test patch (for add_8888_8888) may look like this:

index 91ec27d..f2fc5fb 100644

--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -536,14 +536,38 @@ generate_composite_function \
 
 /******************************************************************************/
 
+.macro pixman_composite_add_8888_8888_init
+    add         DUMMY, sp, #ARGS_STACK_OFFSET
+    vpush       {d8-d15}
+    vld1.32     {d11[0]}, [DUMMY]
+    vdup.8      d8, d11[0]
+    vdup.8      d9, d11[1]
+    vdup.8      d10, d11[2]
+    vdup.8      d11, d11[3]
+.endm
+
+.macro pixman_composite_add_8888_8888_cleanup
+    vpop        {d8-d15}
+.endm
+
+
 .macro pixman_composite_add_8888_8888_process_pixblock_tail_head
     fetch_src_pixblock
+    PF vorr.u8  q12, q0, q1
+    PF vorr.u8  d24, d24, d25
+    PF vcnt.u8  d24, d24
+    PF ldr      DUMMY, [sp]!
+    PF vpadd.u8 d24, d24, d24
+    PF vst1.32  d24[0], [sp]
                                     PF add PF_X, PF_X, #8
                                     PF tst PF_CTL, #0xF
     vld1.32     {d4, d5, d6, d7}, [DST_R, :128]!
                                     PF addne PF_X, PF_X, #8
                                     PF subne PF_CTL, PF_CTL, #1
-        vst1.32     {d28, d29, d30, d31}, [DST_W, :128]!
+        PF cmp DUMMY, #0
+        PF beq 5f
+        vst1.32     {d28, d29, d30, d31}, [DST_W, :128]
+5:      add DST_W, DST_W, #32
                                     PF cmp PF_X, ORIG_W
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
                                     PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]

Here we check the source pixels in order to verify whether all of them are
zero, squeeze the result to a single 32-bit value and put it on stack. This
value gets picked up from stack into ARM register only on the next loop
iteration (hiding the 20 cycle latency) to be used for doing a branch over the
VST1.32 instruction which is performing store to the destination buffer.
Note: this patch is not quite correct because it puts data below the stack
pointer which is not normally permitted. A clean version of this patch needs
modifying NEON supplementary code first, in order to provide some scratch area
on the stack for temporary storage. 

At least case (b) can be optimized in more or less a straightforward way, 
trading off some of the excessive NEON unit cycles for better memory bandwidth
utilization (and up to 20% overall speedup). Optimizing case (a) is going to
be a lot more tricky, but should be possible too. 


1. http://lists.cairographics.org/archives/cairo/2009-October/018434.html
2. https://bugzilla.mozilla.org/show_bug.cgi?id=587499

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20101214/7afe60ee/attachment.pgp>