[Pixman] [PATCH 1/2] ARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head
Siarhei Siamashka
siarhei.siamashka at gmail.com
Wed Apr 6 10:40:08 PDT 2011
On Mon, Apr 4, 2011 at 6:33 PM, Søren Sandmann Pedersen
<sandmann at cs.au.dk> wrote:
> From: Søren Sandmann Pedersen <ssp at redhat.com>
>
> Instead of two
>
> mvn d24, d24
> mvn d25, d25
>
> use just one
>
> mvn q12, q12
Thanks, that's a good catch. At the very least this patch reduces code size.
As far as performance is considered, it removes one instruction, but
instead gets one cycle pipeline stall, so no real speedup:
vmvn.8 q12, q12
/* 1 cycle stall */
vmull.u8 q8, d24, d4
vmull.u8 q9, d25, d5
vmvn.8 d26, d26
vmvn.8 d27, d3
vmull.u8 q10, d26, d6
vmull.u8 q11, d27, d7
/* the rest of code is from tail_head macro */
vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!
vrshr.u16 q14, q8, #8
vrshr.u16 q15, q9, #8
The following change actually makes the code run 1 cycle faster (~2%
overall speedup for working with the perfectly cached data):
vmvn.8 q12, q12
vmvn.8 d26, d26 /* <<<<< moved here */
vmull.u8 q8, d24, d4
vmull.u8 q9, d25, d5
/* from here */
vmvn.8 d27, d3
vmull.u8 q10, d26, d6
vmull.u8 q11, d27, d7
/* the rest of code is from *_tail_head macro */
vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!
vrshr.u16 q14, q8, #8
vrshr.u16 q15, q9, #8
Moving "vmvn.8 d26, d26" up does not cause other performance issues
because we have "vst4.8" instruction there in *_tail_head macro which
helps to avoid potential stall (one typically needs to do something
unrelated for 5 cycles before using the multiplication result).
On a second look, other than that, the whole function is actually fine
on Cortex-A8 and does not need any other performance tweaks because it
did not have many "bubbles" to fill in the first place. You can push
this patch with the additional "vmvn.8 d26, d26" instruction move (I
don't think a separate patch is needed for that).
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list