[Pixman] [PATCH 1/2] ARM: NEON: Better instruction scheduling of over_n_8_8888
Siarhei Siamashka
siarhei.siamashka at gmail.com
Thu Aug 18 01:41:53 PDT 2011
On Tue, Aug 16, 2011 at 4:34 PM, Taekyun Kim <podain77 at gmail.com> wrote:
Assuming that pixman tests pass (I'll try some long running tests a
bit later myself), this looks good to me.
Just a few cosmetic nitpicks below. And they apply to your other patch too.
> From: Taekyun Kim <tkq.kim at samsung.com>
>
> tail/head block is expanded and reordered to eliminate stalls.
It would be a good idea to add the benchmark numbers from your cover
letter here in the commit message.
> ---
> pixman/pixman-arm-neon-asm.S | 77 ++++++++++++++++++++++++++++++-----------
> 1 files changed, 56 insertions(+), 21 deletions(-)
>
> diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
> index 3dc14d7..45fce27 100644
> --- a/pixman/pixman-arm-neon-asm.S
> +++ b/pixman/pixman-arm-neon-asm.S
> @@ -1183,26 +1183,26 @@ generate_composite_function \
> /* mask is in d24 (d25, d26, d27 are unused) */
>
> /* in */
> - vmull.u8 q0, d24, d8
> - vmull.u8 q1, d24, d9
> - vmull.u8 q6, d24, d10
> - vmull.u8 q7, d24, d11
> - vrshr.u16 q10, q0, #8
> - vrshr.u16 q11, q1, #8
> - vrshr.u16 q12, q6, #8
> - vrshr.u16 q13, q7, #8
> - vraddhn.u16 d0, q0, q10
> - vraddhn.u16 d1, q1, q11
> - vraddhn.u16 d2, q6, q12
> - vraddhn.u16 d3, q7, q13
> - vmvn.8 d24, d3 /* get inverted alpha */
> + vmull.u8 q6, d24, d8
> + vmull.u8 q7, d24, d9
> + vmull.u8 q8, d24, d10
> + vmull.u8 q9, d24, d11
> + vrshr.u16 q10, q6, #8
> + vrshr.u16 q11, q7, #8
> + vrshr.u16 q12, q8, #8
> + vrshr.u16 q13, q9, #8
> + vraddhn.u16 d0, q6, q10
> + vraddhn.u16 d1, q7, q11
> + vraddhn.u16 d2, q8, q12
> + vraddhn.u16 d3, q9, q13
> + vmvn.8 d25, d3 /* get inverted alpha */
> /* source: d0 - blue, d1 - green, d2 - red, d3 - alpha */
> /* destination: d4 - blue, d5 - green, d6 - red, d7 - alpha */
> /* now do alpha blending */
> - vmull.u8 q8, d24, d4
> - vmull.u8 q9, d24, d5
> - vmull.u8 q10, d24, d6
> - vmull.u8 q11, d24, d7
> + vmull.u8 q8, d25, d4
> + vmull.u8 q9, d25, d5
> + vmull.u8 q10, d25, d6
> + vmull.u8 q11, d25, d7
> .endm
>
> .macro pixman_composite_over_n_8_8888_process_pixblock_tail
> @@ -1220,12 +1220,47 @@ generate_composite_function \
>
> /* TODO: expand macros and do better instructions scheduling */
You have done it. So I guess there is no need to keep this "TODO"
comment anymore :)
> .macro pixman_composite_over_n_8_8888_process_pixblock_tail_head
> - pixman_composite_over_n_8_8888_process_pixblock_tail
> - vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!
> + vrshr.u16 q14, q8, #8
> vld4.8 {d4, d5, d6, d7}, [DST_R, :128]!
> + vrshr.u16 q15, q9, #8
> fetch_mask_pixblock
> - cache_preload 8, 8
> - pixman_composite_over_n_8_8888_process_pixblock_head
> + vrshr.u16 q12, q10, #8
> + PF add PF_X, PF_X, #8
> + vrshr.u16 q13, q11, #8
> + PF tst PF_CTL, #0x0F
> + vraddhn.u16 d28, q14, q8
> + PF addne PF_X, PF_X, #8
> + vraddhn.u16 d29, q15, q9
> + PF subne PF_CTL, PF_CTL, #1
> + vraddhn.u16 d30, q12, q10
> + PF cmp PF_X, ORIG_W
> + vraddhn.u16 d31, q13, q11
> + PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]
> + vmull.u8 q6, d24, d8
> + PF pld, [PF_MASK, PF_X, lsl #mask_bpp_shift]
> + vmull.u8 q7, d24, d9
> + PF subge PF_X, PF_X, ORIG_W
> + vmull.u8 q8, d24, d10
> + PF subges PF_CTL, PF_CTL, #0x10
> + vmull.u8 q9, d24, d11
> + PF ldrgeb DUMMY, [PF_DST, DST_STRIDE, lsl #dst_bpp_shift]!
> + vqadd.u8 q14, q0, q14
> + PF ldrgeb DUMMY, [PF_MASK, MASK_STRIDE, lsl #mask_bpp_shift]!
> + vqadd.u8 q15, q1, q15
> + vrshr.u16 q10, q6, #8
> + vrshr.u16 q11, q7, #8
> + vrshr.u16 q12, q8, #8
> + vrshr.u16 q13, q9, #8
> + vraddhn.u16 d0, q6, q10
> + vraddhn.u16 d1, q7, q11
> + vraddhn.u16 d2, q8, q12
> + vraddhn.u16 d3, q9, q13
> + vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!
> + vmvn.8 d25, d3
> + vmull.u8 q8, d25, d4
> + vmull.u8 q9, d25, d5
> + vmull.u8 q10, d25, d6
> + vmull.u8 q11, d25, d7
> .endm
>
> .macro pixman_composite_over_n_8_8888_init
> --
> 1.7.1
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list