[Pixman] [PATCH 2/2] ARMv6: Add fast path for in_reverse_8888_8888
Pekka Paalanen
ppaalanen at gmail.com
Mon Mar 31 05:54:24 PDT 2014
From: Ben Avison <bavison at riscosopen.org>
Benchmark results, "before" is the patch
- ARMv6: Add fast path for over_n_8888_8888_ca
and "after" contains the additional patches:
- ARMv6: Add fast path flag to force no preload of destination buffer
- ARMv6: Add fast path for in_reverse_8888_8888 (this patch)
lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 21.1 0.1 32.0 0.1 100.00% +51.9%
L2 11.7 0.3 18.4 0.5 100.00% +56.9%
M 10.5 0.0 16.3 0.0 100.00% +54.8%
HT 8.2 0.0 12.0 0.0 100.00% +46.7%
VT 8.1 0.0 11.8 0.0 100.00% +45.4%
R 8.0 0.0 11.2 0.0 100.00% +40.0%
RT 4.7 0.0 6.0 0.1 100.00% +28.1%
At most 14 outliers rejected per case per set.
cairo-perf-trace with trimmed traces, 30 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
t-firefox-paintball.trace 17.9 0.0 14.0 0.0 100.00% +27.8%
t-firefox-chalkboard.trace 36.6 0.0 35.8 0.0 100.00% +2.1%
t-firefox-canvas-alpha.trace 20.7 0.3 20.3 0.3 100.00% +1.7%
t-firefox-particles.trace 27.5 0.1 27.1 0.1 100.00% +1.3%
t-chromium-tabs.trace 4.9 0.0 4.8 0.0 100.00% +1.1%
t-evolution.trace 13.0 0.1 12.9 0.1 100.00% +1.0%
t-swfdec-youtube.trace 7.8 0.0 7.7 0.0 100.00% +0.8%
t-gvim.trace 33.0 0.2 32.8 0.2 100.00% +0.7%
t-gnome-terminal-vim.trace 19.8 0.2 19.7 0.2 99.46% +0.6%
t-grads-heat-map.trace 4.4 0.0 4.4 0.0 99.32% +0.6%
t-firefox-fishbowl.trace 21.1 0.0 21.0 0.0 100.00% +0.5%
t-firefox-planet-gnome.trace 10.9 0.0 10.8 0.0 100.00% +0.4%
t-firefox-canvas-swscroll.trace 32.1 0.1 32.0 0.1 100.00% +0.4%
t-firefox-fishtank.trace 13.2 0.0 13.1 0.0 100.00% +0.4%
t-firefox-asteroids.trace 11.1 0.0 11.0 0.0 100.00% +0.4%
t-firefox-canvas.trace 17.9 0.0 17.9 0.0 99.99% +0.3%
t-poppler.trace 9.7 0.1 9.7 0.1 79.51% +0.2% (insignificant)
t-firefox-talos-svg.trace 20.4 0.0 20.4 0.0 97.25% +0.1% (insignificant)
t-swfdec-giant-steps.trace 14.8 0.0 14.8 0.0 96.75% +0.1% (insignificant)
t-firefox-scrolling.trace 24.6 0.1 24.6 0.1 31.24% +0.1% (insignificant)
t-midori-zoomed.trace 8.0 0.0 8.0 0.0 50.76% +0.0% (insignificant)
t-gnome-system-monitor.trace 17.1 0.0 17.1 0.0 4.49% -0.0% (insignificant)
t-xfce4-terminal-a1.trace 4.8 0.0 4.8 0.0 98.08% -0.2% (insignificant)
t-poppler-reseau.trace 22.1 0.1 22.2 0.1 93.89% -0.3% (insignificant)
t-firefox-talos-gfx.trace 25.4 0.4 25.5 0.5 75.53% -0.5% (insignificant)
At most 4 outliers rejected per case per set.
Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).
Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.
There was a question of why FLAG_NO_PRELOAD_DST exists. If a patch
removing that flag from pixman-arm-simd-asm.S is added on top, the
change will be the following.
Before: flag in use
After: flag removed
Before After
Mean StdDev Mean StdDev Confidence Change
L1 32.0 0.1 31.8 0.1 100.00% -0.6%
L2 18.4 0.5 25.0 0.5 100.00% +36.0%
M 16.3 0.0 25.7 0.0 100.00% +57.9%
HT 12.0 0.0 13.9 0.0 100.00% +16.4%
VT 11.8 0.0 13.2 0.0 100.00% +12.4%
R 11.2 0.0 14.0 0.0 100.00% +24.3%
RT 6.0 0.1 7.0 0.1 100.00% +15.1%
Before After
Mean StdDev Mean StdDev Confidence Change
t-chromium-tabs.trace 4.8 0.0 4.8 0.0 100.00% +0.7%
t-poppler-reseau.trace 22.2 0.1 22.1 0.1 99.98% +0.6%
t-poppler.trace 9.7 0.1 9.6 0.1 99.70% +0.5%
t-firefox-talos-gfx.trace 25.5 0.5 25.4 0.3 72.06% +0.5% (insignificant)
t-firefox-canvas-alpha.trace 20.3 0.3 20.2 0.2 80.88% +0.4% (insignificant)
t-firefox-canvas.trace 17.9 0.0 17.8 0.0 99.36% +0.2%
t-firefox-canvas-swscroll.trace 32.0 0.1 31.9 0.1 84.83% +0.1% (insignificant)
t-firefox-asteroids.trace 11.0 0.0 11.0 0.0 100.00% +0.1%
t-midori-zoomed.trace 8.0 0.0 8.0 0.0 99.90% +0.1%
t-firefox-planet-gnome.trace 10.8 0.0 10.8 0.0 91.34% +0.1% (insignificant)
t-firefox-scrolling.trace 24.6 0.1 24.6 0.1 0.53% +0.0% (insignificant)
t-gnome-terminal-vim.trace 19.7 0.2 19.7 0.1 11.42% -0.0% (insignificant)
t-firefox-talos-svg.trace 20.4 0.0 20.4 0.0 54.68% -0.0% (insignificant)
t-swfdec-giant-steps.trace 14.8 0.0 14.8 0.0 78.92% -0.0% (insignificant)
t-firefox-fishtank.trace 13.1 0.0 13.1 0.0 97.09% -0.0% (insignificant)
t-gnome-system-monitor.trace 17.1 0.0 17.1 0.0 65.13% -0.0% (insignificant)
t-evolution.trace 12.9 0.1 12.9 0.1 34.70% -0.1% (insignificant)
t-grads-heat-map.trace 4.4 0.0 4.4 0.0 28.95% -0.1% (insignificant)
t-firefox-fishbowl.trace 21.0 0.0 21.0 0.0 99.92% -0.2%
t-xfce4-terminal-a1.trace 4.8 0.0 4.8 0.0 98.78% -0.2% (insignificant)
t-firefox-particles.trace 27.1 0.1 27.3 0.1 99.89% -0.5%
t-swfdec-youtube.trace 7.7 0.0 7.8 0.0 100.00% -0.7%
t-gvim.trace 32.8 0.2 33.1 0.2 100.00% -0.9%
t-firefox-chalkboard.trace 35.8 0.0 37.1 0.0 100.00% -3.3%
t-firefox-paintball.trace 14.0 0.0 15.0 0.0 100.00% -6.2%
IOW, the flag has adverse effects on lowlevel-blt-bench performance,
but improves one or two Cairo traces slightly.
v4, Pekka Paalanen <pekka.paalanen at collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.
---
Should I re-spin this without the flag? Ben?
It should not need a new benchmarking night, since I already have
the numbers.
Thanks,
pq
---
pixman/pixman-arm-simd-asm.S | 103 +++++++++++++++++++++++++++++++++++++++++++
pixman/pixman-arm-simd.c | 7 +++
2 files changed, 110 insertions(+)
diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 7bb18cb..d926226 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -954,3 +954,106 @@ generate_composite_function \
/******************************************************************************/
+.macro in_reverse_8888_8888_init
+ /* Hold loop invariant in MASK */
+ ldr MASK, =0x00800080
+ /* Set GE[3:0] to 0101 so SEL instructions do what we want */
+ uadd8 SCRATCH, MASK, MASK
+ /* Offset the source pointer: we only need the alpha bytes */
+ add SRC, SRC, #3
+ line_saved_regs ORIG_W
+.endm
+
+.macro in_reverse_8888_8888_head numbytes, reg1, reg2, reg3
+ ldrb ORIG_W, [SRC], #4
+ .if numbytes >= 8
+ ldrb WK®1, [SRC], #4
+ .if numbytes == 16
+ ldrb WK®2, [SRC], #4
+ ldrb WK®3, [SRC], #4
+ .endif
+ .endif
+ add DST, DST, #numbytes
+.endm
+
+.macro in_reverse_8888_8888_process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload
+ in_reverse_8888_8888_head numbytes, firstreg, %(firstreg+1), %(firstreg+2)
+.endm
+
+.macro in_reverse_8888_8888_1pixel s, d, offset, is_only
+ .if is_only != 1
+ movs s, ORIG_W
+ .if offset != 0
+ ldrb ORIG_W, [SRC, #offset]
+ .endif
+ beq 01f
+ teq STRIDE_M, #0xFF
+ beq 02f
+ .endif
+ uxtb16 SCRATCH, d /* rb_dest */
+ uxtb16 d, d, ror #8 /* ag_dest */
+ mla SCRATCH, SCRATCH, s, MASK
+ mla d, d, s, MASK
+ uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8
+ uxtab16 d, d, d, ror #8
+ mov SCRATCH, SCRATCH, ror #8
+ sel d, SCRATCH, d
+ b 02f
+ .if offset == 0
+48: /* Last mov d,#0 of the set - used as part of shortcut for
+ * source values all 0 */
+ .endif
+01: mov d, #0
+02:
+.endm
+
+.macro in_reverse_8888_8888_tail numbytes, reg1, reg2, reg3, reg4
+ .if numbytes == 4
+ teq ORIG_W, ORIG_W, asr #32
+ ldrne WK®1, [DST, #-4]
+ .elseif numbytes == 8
+ teq ORIG_W, WK®1
+ teqeq ORIG_W, ORIG_W, asr #32 /* all 0 or all -1? */
+ ldmnedb DST, {WK®1-WK®2}
+ .else
+ teq ORIG_W, WK®1
+ teqeq ORIG_W, WK®2
+ teqeq ORIG_W, WK®3
+ teqeq ORIG_W, ORIG_W, asr #32 /* all 0 or all -1? */
+ ldmnedb DST, {WK®1-WK®4}
+ .endif
+ cmnne DST, #0 /* clear C if NE */
+ bcs 49f /* no writes to dest if source all -1 */
+ beq 48f /* set dest to all 0 if source all 0 */
+ .if numbytes == 4
+ in_reverse_8888_8888_1pixel ORIG_W, WK®1, 0, 1
+ str WK®1, [DST, #-4]
+ .elseif numbytes == 8
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®1, -4, 0
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®2, 0, 0
+ stmdb DST, {WK®1-WK®2}
+ .else
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®1, -12, 0
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®2, -8, 0
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®3, -4, 0
+ in_reverse_8888_8888_1pixel STRIDE_M, WK®4, 0, 0
+ stmdb DST, {WK®1-WK®4}
+ .endif
+49:
+.endm
+
+.macro in_reverse_8888_8888_process_tail cond, numbytes, firstreg
+ in_reverse_8888_8888_tail numbytes, firstreg, %(firstreg+1), %(firstreg+2), %(firstreg+3)
+.endm
+
+generate_composite_function \
+ pixman_composite_in_reverse_8888_8888_asm_armv6, 32, 0, 32 \
+ FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH | FLAG_NO_PRELOAD_DST \
+ 2, /* prefetch distance */ \
+ in_reverse_8888_8888_init, \
+ nop_macro, /* newline */ \
+ nop_macro, /* cleanup */ \
+ in_reverse_8888_8888_process_head, \
+ in_reverse_8888_8888_process_tail
+
+/******************************************************************************/
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index dd6b907..c17ce5a 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -46,6 +46,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
uint8_t, 1, uint8_t, 1)
PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over_8888_8888,
uint32_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, in_reverse_8888_8888,
+ uint32_t, 1, uint32_t, 1)
PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, armv6, over_reverse_n_8888,
uint32_t, 1)
@@ -241,6 +243,11 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8b8g8r8, armv6_composite_over_n_8_8888),
PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8b8g8r8, armv6_composite_over_n_8_8888),
+ PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, a8r8g8b8, armv6_composite_in_reverse_8888_8888),
+ PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, x8r8g8b8, armv6_composite_in_reverse_8888_8888),
+ PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, a8b8g8r8, armv6_composite_in_reverse_8888_8888),
+ PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, x8b8g8r8, armv6_composite_in_reverse_8888_8888),
+
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, armv6_composite_over_n_8888_8888_ca),
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8, armv6_composite_over_n_8888_8888_ca),
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, armv6_composite_over_n_8888_8888_ca),
--
1.8.3.2
More information about the Pixman
mailing list