[Pixman] Better memory bandwidth utilization in pixman ARM NEON optimizations

Siarhei Siamashka siarhei.siamashka at gmail.com
Tue Dec 14 09:41:55 PST 2010


On Tuesday 14 December 2010 18:32:04 Siarhei Siamashka wrote:

Well, appears that some updates may be useful/necessary.

> * Because of the software prefetch, skipping read of the destination image
> pixels does not bring much improvement, moreover there is just no
> improvement at all on the newest OMAP3 and OMAP4 devices. This happens
> because we are prefetching data into cache far ahead and just skipping
> read instruction is not providing memory bandwidth saving (the data was
> already fetched into cache by this time).

Forgot to mention that there is one interesting case here. Because PLD 
instruction only does prefetch on TLB hit, there actually may be
performance improvement when skipping both reading and writing pixels in
the destination image (fully transparent source image). But this may happen
only if we are not touching whole pages in the destination buffer, which is
another constraint.

> index 91ec27d..f2fc5fb 100644
> --- a/pixman/pixman-arm-neon-asm.S
> +++ b/pixman/pixman-arm-neon-asm.S
> @@ -536,14 +536,38 @@ generate_composite_function \
> 
>  /*************************************************************************
> *****/
> 
> +.macro pixman_composite_add_8888_8888_init
> +    add         DUMMY, sp, #ARGS_STACK_OFFSET
> +    vpush       {d8-d15}
> +    vld1.32     {d11[0]}, [DUMMY]
> +    vdup.8      d8, d11[0]
> +    vdup.8      d9, d11[1]
> +    vdup.8      d10, d11[2]
> +    vdup.8      d11, d11[3]
> +.endm
> +
> +.macro pixman_composite_add_8888_8888_cleanup
> +    vpop        {d8-d15}
> +.endm
> +
> +
>  .macro pixman_composite_add_8888_8888_process_pixblock_tail_head
>      fetch_src_pixblock
> +    PF vorr.u8  q12, q0, q1
> +    PF vorr.u8  d24, d24, d25
> +    PF vcnt.u8  d24, d24
> +    PF ldr      DUMMY, [sp]!
> +    PF vpadd.u8 d24, d24, d24
> +    PF vst1.32  d24[0], [sp]
>                                      PF add PF_X, PF_X, #8
>                                      PF tst PF_CTL, #0xF
>      vld1.32     {d4, d5, d6, d7}, [DST_R, :128]!
>                                      PF addne PF_X, PF_X, #8
>                                      PF subne PF_CTL, PF_CTL, #1
> -        vst1.32     {d28, d29, d30, d31}, [DST_W, :128]!
> +        PF cmp DUMMY, #0
> +        PF beq 5f
> +        vst1.32     {d28, d29, d30, d31}, [DST_W, :128]
> +5:      add DST_W, DST_W, #32
>                                      PF cmp PF_X, ORIG_W
>                                      PF pld, [PF_SRC, PF_X, lsl
> #src_bpp_shift] PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]

As it happens, I messed up and attached a wrong work-in-progress patch
here (not the one that was intended) :) Naturally the dangling 'init' and
'cleanup' macros don't affect anything, and 'head' macro needs to be updated
too in order to put the correct initial value on stack, otherwise the outcome
of the first branch is undefined. Anyway, I think I just need to provide a
final patch a bit later and be done with it.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20101214/21de07b9/attachment.pgp>


More information about the Pixman mailing list