[Pixman] [PATCH 2/2] ARMv6: Speed improvement to L1 cache constrained blits.
Ben Avison
bavison at riscosopen.org
Mon Jan 14 11:16:41 PST 2013
This was achieved by
* if only one of source and mask was being prefetched and the pixel
processing doesn't use the SCRATCH register, using it to precalculate the
preload offset such that it falls in the L2 cache sweet spot
* enabling the line variables to be spilled to the stack but only for
the wide case, rather than in all cases
* enabling the central loop to be entirely replaced on per-operation basis;
this benefits blits in particular due to being able to use alternate
banks of 4 registers to avoid interlocks (but it should be noted that
these stalls are totally subsumed by the access latency of the L2 cache,
let alone main memory)
Now the lowlevel-blt-bench L1 test results are comparable (even slightly
better than) the C fast path implementation which uses the system memcpy.
Here is some analysis of the combined effect of two parts of this patch
using lowlevel-blt-bench, listing only those tests that show a
statistically significant change, for brevity:
src_n_8888, src_n_0565, src_n_8, src_x888_8888, src_0565_8888: no change
src_8888_8888:
Before After
Mean StdDev Mean StdDev Change Confidence
L1 349.5 26.1 435.8 34.3 24.7% 100.0%
L2 110.1 10.1 115.0 10.4 4.5% 99.9%
VT 34.3 0.4 34.5 0.5 0.5% 99.6%
src_0565_0565:
Before After
Mean StdDev Mean StdDev Change Confidence
L1 309.8 19.7 394.0 23.7 27.2% 100.0%
L2 117.4 5.7 128.4 5.0 9.4% 100.0%
HT 51.8 0.8 52.2 0.9 0.8% 100.0%
VT 45.9 0.7 46.3 0.7 0.7% 99.9%
R 40.5 0.5 41.1 0.6 1.5% 100.0%
RT 12.1 0.5 13.0 0.4 6.8% 100.0%
src_8_8:
Before After
Mean StdDev Mean StdDev Change Confidence
L1 627.3 29.0 759.3 70.5 21.1% 100.0%
L2 235.8 5.5 260.5 8.8 10.5% 100.0%
HT 59.6 1.3 62.1 1.0 4.2% 100.0%
R 45.1 0.7 48.2 0.8 6.8% 100.0%
RT 12.0 0.3 12.9 0.4 7.4% 100.0%
add_8_8:
Before After
Mean StdDev Mean StdDev Change Confidence
L1 562.1 60.6 541.3 38.9 -3.7% 99.6%
HT 35.9 0.4 43.3 0.6 20.8% 100.0%
VT 34.4 0.5 39.3 0.5 14.1% 100.0%
R 28.5 0.3 35.4 0.4 24.2% 100.0%
RT 9.0 0.2 10.2 0.3 13.0% 100.0%
over_8888_8888:
Before After
Mean StdDev Mean StdDev Change Confidence
L1 37.7 0.5 37.9 0.5 0.7% 100.0%
L2 31.0 0.3 30.6 0.5 -1.3% 100.0%
HT 14.4 0.1 15.4 0.1 7.5% 100.0%
VT 13.7 0.1 14.6 0.1 6.1% 100.0%
R 14.3 0.1 15.8 0.1 10.3% 100.0%
RT 6.6 0.1 7.6 0.1 14.8% 100.0%
over_8888_n_8888:
Before After
Mean StdDev Mean StdDev Change Confidence
HT 11.4 0.1 11.9 0.2 4.5% 100.0%
VT 10.9 0.1 11.3 0.2 3.6% 100.0%
R 11.1 0.1 11.8 0.1 6.6% 100.0%
RT 5.6 0.1 6.2 0.1 10.7% 100.0%
over_n_8_8888:
Before After
Mean StdDev Mean StdDev Change Confidence
HT 12.4 0.3 14.1 0.1 14.0% 100.0%
VT 11.8 0.1 13.5 0.1 15.0% 100.0%
R 10.9 0.1 12.9 0.2 18.0% 100.0%
RT 5.9 0.1 6.5 0.2 10.6% 100.0%
One point to note is that in combination with my patch from 2012-01-08,
there is now only one result that still shows a statistically significant
regression: the src_0565_0565 L2 test. This is quite an oddity,
considering the L1 and M tests both show an improvement, as do the L2
tests for src_8888_8888 and src_8_8 which have identical inner loops
(albeit with different chances of hitting the aligned and unaligned code
paths).
The cairo-perf-trace results are also generally positive.
t-firefox-chalkboard (which was a 3.9% regression) is now improved by
5.6%, to a net 1.5% improvement. The worst regression here is
t-grads-heat-map, 1.6%. This is the sort of percentage by which I
routinely find results vary randomly on repeat runs, so this may be
possible to explain this away as a fluke result.
This is a reminder of the "after" result in my previous patch:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.1
[ 0] image t-swfdec-giant-steps 13.501 13.521 0.10% 6/6
[ 1] image t-firefox-asteroids 10.400 10.424 0.12% 5/6
[ 2] image t-firefox-fishbowl 22.488 22.490 0.01% 5/6
[ 3] image t-firefox-chalkboard 37.193 37.196 0.01% 5/6
[ 4] image t-midori-zoomed 6.316 6.362 0.31% 6/6
[ 5] image t-firefox-scrolling 24.375 24.379 0.01% 4/6
[ 6] image t-poppler 11.517 11.544 0.13% 5/6
[ 7] image t-chromium-tabs 4.226 4.245 0.26% 6/6
[ 8] image t-grads-heat-map 3.727 3.780 0.69% 6/6
[ 9] image t-firefox-canvas-alpha 18.897 19.082 0.64% 6/6
[ 10] image t-firefox-talos-gfx 27.889 27.950 0.34% 6/6
[ 11] image t-gnome-terminal-vim 19.411 19.545 0.35% 6/6
[ 12] image t-firefox-fishtank 19.103 19.112 0.12% 6/6
[ 13] image t-evolution 11.303 11.340 0.20% 6/6
[ 14] image t-poppler-reseau 21.678 21.824 0.33% 5/6
[ 15] image t-firefox-talos-svg 18.909 18.933 0.08% 6/6
[ 16] image t-firefox-planet-gnome 10.936 10.966 0.15% 6/6
[ 17] image t-firefox-particles 24.224 24.249 0.07% 6/6
[ 18] image t-gnome-system-monitor 13.538 13.584 0.46% 6/6
[ 19] image t-firefox-canvas 16.394 16.410 0.10% 6/6
[ 20] image t-swfdec-youtube 9.694 9.737 0.31% 6/6
[ 21] image t-gvim 18.313 18.334 0.12% 6/6
[ 22] image t-firefox-paintball 19.364 19.392 0.07% 6/6
[ 23] image t-xfce4-terminal-a1 22.253 22.409 0.43% 6/6
And these are the results I'm getting after this patch series is applied:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.1
[ 0] image t-swfdec-giant-steps 13.501 13.526 0.14% 5/6
[ 1] image t-firefox-asteroids 10.395 10.421 0.14% 6/6
[ 2] image t-firefox-fishbowl 22.564 22.570 0.02% 6/6
[ 3] image t-firefox-chalkboard 35.220 35.234 0.02% 6/6
[ 4] image t-midori-zoomed 6.296 6.322 0.18% 6/6
[ 5] image t-firefox-scrolling 24.367 24.412 0.11% 6/6
[ 6] image t-poppler 11.389 11.432 0.24% 6/6
[ 7] image t-chromium-tabs 4.190 4.226 0.77% 6/6
[ 8] image t-grads-heat-map 3.787 3.794 0.08% 4/6
[ 9] image t-firefox-canvas-alpha 18.791 18.926 0.71% 6/6
[ 10] image t-firefox-talos-gfx 27.934 28.079 0.30% 6/6
[ 11] image t-gnome-terminal-vim 19.492 19.636 0.43% 6/6
[ 12] image t-firefox-fishtank 19.140 19.155 0.04% 5/6
[ 13] image t-evolution 11.375 11.423 0.27% 6/6
[ 14] image t-poppler-reseau 21.667 21.794 0.25% 5/6
[ 15] image t-firefox-talos-svg 18.921 18.933 0.05% 5/6
[ 16] image t-firefox-planet-gnome 11.023 11.033 0.05% 5/6
[ 17] image t-firefox-particles 23.925 23.973 0.38% 6/6
[ 18] image t-gnome-system-monitor 13.376 13.431 0.24% 6/6
[ 19] image t-firefox-canvas 16.400 16.500 0.38% 6/6
[ 20] image t-swfdec-youtube 9.621 9.622 0.04% 4/6
[ 21] image t-gvim 18.336 18.400 0.18% 6/6
[ 22] image t-firefox-paintball 19.364 19.376 0.04% 6/6
[ 23] image t-xfce4-terminal-a1 22.288 22.366 0.35% 6/6
---
pixman/pixman-arm-simd-asm.S | 53 +++++++++++++-----
pixman/pixman-arm-simd-asm.h | 122 +++++++++++++++++++++++++++---------------
2 files changed, 118 insertions(+), 57 deletions(-)
diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index f043826..a380d8b 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -51,39 +51,64 @@
* preload If outputting 16 bytes causes 64 bytes to be read, whether an extra preload should be output
*/
+.macro blit_init
+ line_saved_regs STRIDE_D, STRIDE_S
+.endm
+
.macro blit_process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload
pixld cond, numbytes, firstreg, SRC, unaligned_src
.endm
+.macro blit_inner_loop process_head, process_tail, unaligned_src, unaligned_mask, dst_alignment
+ WK4 .req STRIDE_D
+ WK5 .req STRIDE_S
+ WK6 .req MASK
+ WK7 .req STRIDE_M
+110: pixld , 16, 0, SRC, unaligned_src
+ pixld , 16, 4, SRC, unaligned_src
+ pld [SRC, SCRATCH]
+ pixst , 16, 0, DST
+ pixst , 16, 4, DST
+ subs X, X, #32*8/src_bpp
+ bhs 110b
+ .unreq WK4
+ .unreq WK5
+ .unreq WK6
+ .unreq WK7
+.endm
+
generate_composite_function \
pixman_composite_src_8888_8888_asm_armv6, 32, 0, 32, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
3, /* prefetch distance */ \
- nop_macro, /* init */ \
+ blit_init, \
nop_macro, /* newline */ \
nop_macro, /* cleanup */ \
blit_process_head, \
- nop_macro /* process tail */
+ nop_macro, /* process tail */ \
+ blit_inner_loop
generate_composite_function \
pixman_composite_src_0565_0565_asm_armv6, 16, 0, 16, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
3, /* prefetch distance */ \
- nop_macro, /* init */ \
+ blit_init, \
nop_macro, /* newline */ \
nop_macro, /* cleanup */ \
blit_process_head, \
- nop_macro /* process tail */
+ nop_macro, /* process tail */ \
+ blit_inner_loop
generate_composite_function \
pixman_composite_src_8_8_asm_armv6, 8, 0, 8, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
3, /* prefetch distance */ \
- nop_macro, /* init */ \
+ blit_init, \
nop_macro, /* newline */ \
nop_macro, /* cleanup */ \
blit_process_head, \
- nop_macro /* process tail */
+ nop_macro, /* process tail */ \
+ blit_inner_loop
/******************************************************************************/
@@ -125,7 +150,7 @@ generate_composite_function \
generate_composite_function \
pixman_composite_src_n_8888_asm_armv6, 0, 0, 32, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
0, /* prefetch distance doesn't apply */ \
src_n_8888_init \
nop_macro, /* newline */ \
@@ -135,7 +160,7 @@ generate_composite_function \
generate_composite_function \
pixman_composite_src_n_0565_asm_armv6, 0, 0, 16, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
0, /* prefetch distance doesn't apply */ \
src_n_0565_init \
nop_macro, /* newline */ \
@@ -145,7 +170,7 @@ generate_composite_function \
generate_composite_function \
pixman_composite_src_n_8_asm_armv6, 0, 0, 8, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
0, /* prefetch distance doesn't apply */ \
src_n_8_init \
nop_macro, /* newline */ \
@@ -176,7 +201,7 @@ generate_composite_function \
generate_composite_function \
pixman_composite_src_x888_8888_asm_armv6, 32, 0, 32, \
- FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+ FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_SCRATCH, \
3, /* prefetch distance */ \
nop_macro, /* init */ \
nop_macro, /* newline */ \
@@ -315,7 +340,7 @@ generate_composite_function \
generate_composite_function \
pixman_composite_add_8_8_asm_armv6, 8, 0, 8, \
- FLAG_DST_READWRITE | FLAG_BRANCH_OVER, \
+ FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_PRESERVES_SCRATCH, \
2, /* prefetch distance */ \
nop_macro, /* init */ \
nop_macro, /* newline */ \
diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index ee70131..0ecdc7a 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -42,10 +42,12 @@
.set FLAG_PROCESS_CORRUPTS_PSR, 4
.set FLAG_PROCESS_DOESNT_STORE, 0
.set FLAG_PROCESS_DOES_STORE, 8 /* usually because it needs to conditionally skip it */
-.set FLAG_NO_SPILL_LINE_VARS, 0
-.set FLAG_SPILL_LINE_VARS, 16
-.set FLAG_PRELOAD_ALL_WIDTHS, 0
-.set FLAG_ONLY_PRELOAD_WIDE, 32
+.set FLAG_NO_SPILL_LINE_VARS, 0
+.set FLAG_SPILL_LINE_VARS_WIDE, 16
+.set FLAG_SPILL_LINE_VARS_NON_WIDE, 32
+.set FLAG_SPILL_LINE_VARS, 48
+.set FLAG_PROCESS_CORRUPTS_SCRATCH, 0
+.set FLAG_PROCESS_PRESERVES_SCRATCH, 64
/*
* Offset into stack where mask and source pointer/stride can be accessed.
@@ -184,12 +186,16 @@
.endm
#define IS_END_OF_GROUP(INDEX,SIZE) ((SIZE) < 2 || ((INDEX) & ~((INDEX)+1)) & ((SIZE)/2))
-.macro preload_middle bpp, base
+.macro preload_middle bpp, base, scratch_holds_offset
.if bpp > 0
/* prefetch distance = 256/bpp, stm distance = 128/dst_w_bpp */
.if IS_END_OF_GROUP(SUBBLOCK,256/128*dst_w_bpp/bpp)
+ .if scratch_holds_offset
+ PF pld, [base, SCRATCH]
+ .else
PF bic, SCRATCH, base, #31
PF pld, [SCRATCH, #32*prefetch_distance]
+ .endif
.endif
.endif
.endm
@@ -360,8 +366,14 @@
.set SUBBLOCK, 0 /* this is a count of STMs; there can be up to 8 STMs per block */
.rept pix_per_block*dst_w_bpp/128
process_head , 16, 0, unaligned_src, unaligned_mask, 1
- preload_middle src_bpp, SRC
- preload_middle mask_bpp, MASK
+ .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+ preload_middle src_bpp, SRC, 1
+ .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+ preload_middle mask_bpp, MASK, 1
+ .else
+ preload_middle src_bpp, SRC, 0
+ preload_middle mask_bpp, MASK, 0
+ .endif
.if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0)
/* Because we know that writes are 16-byte aligned, it's relatively easy to ensure that
* destination prefetches are 32-byte aligned. It's also the easiest channel to offset
@@ -380,16 +392,16 @@
bhs 110b
.endm
-.macro wide_case_inner_loop_and_trailing_pixels process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro wide_case_inner_loop_and_trailing_pixels process_head, process_tail, process_inner_loop, exit_label, unaligned_src, unaligned_mask
/* Destination now 16-byte aligned; we have at least one block before we have to stop preloading */
.if dst_r_bpp > 0
tst DST, #16
bne 111f
- wide_case_inner_loop process_head, process_tail, unaligned_src, unaligned_mask, 16
+ process_inner_loop process_head, process_tail, unaligned_src, unaligned_mask, 16
b 112f
111:
.endif
- wide_case_inner_loop process_head, process_tail, unaligned_src, unaligned_mask, 0
+ process_inner_loop process_head, process_tail, unaligned_src, unaligned_mask, 0
112:
/* Just before the final (prefetch_distance+1) 32-byte blocks, deal with final preloads */
.if (src_bpp*pix_per_block > 256) || (mask_bpp*pix_per_block > 256) || (dst_r_bpp*pix_per_block > 256)
@@ -400,10 +412,10 @@
preload_trailing dst_r_bpp, dst_bpp_shift, DST
add X, X, #(prefetch_distance+2)*pix_per_block - 128/dst_w_bpp
/* The remainder of the line is handled identically to the medium case */
- medium_case_inner_loop_and_trailing_pixels process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+ medium_case_inner_loop_and_trailing_pixels process_head, process_tail,, exit_label, unaligned_src, unaligned_mask
.endm
-.macro medium_case_inner_loop_and_trailing_pixels process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro medium_case_inner_loop_and_trailing_pixels process_head, process_tail, unused, exit_label, unaligned_src, unaligned_mask
120:
process_head , 16, 0, unaligned_src, unaligned_mask, 0
process_tail , 16, 0
@@ -418,7 +430,7 @@
trailing_15bytes process_head, process_tail, unaligned_src, unaligned_mask
.endm
-.macro narrow_case_inner_loop_and_trailing_pixels process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro narrow_case_inner_loop_and_trailing_pixels process_head, process_tail, unused, exit_label, unaligned_src, unaligned_mask
tst X, #16*8/dst_w_bpp
conditional_process1 ne, process_head, process_tail, 16, 0, unaligned_src, unaligned_mask, 0
/* Trailing pixels */
@@ -426,7 +438,7 @@
trailing_15bytes process_head, process_tail, unaligned_src, unaligned_mask
.endm
-.macro switch_on_alignment action, process_head, process_tail, exit_label
+.macro switch_on_alignment action, process_head, process_tail, process_inner_loop, exit_label
/* Note that if we're reading the destination, it's already guaranteed to be aligned at this point */
.if mask_bpp == 8 || mask_bpp == 16
tst MASK, #3
@@ -436,11 +448,11 @@
tst SRC, #3
bne 140f
.endif
- action process_head, process_tail, exit_label, 0, 0
+ action process_head, process_tail, process_inner_loop, exit_label, 0, 0
.if src_bpp == 8 || src_bpp == 16
b exit_label
140:
- action process_head, process_tail, exit_label, 1, 0
+ action process_head, process_tail, process_inner_loop, exit_label, 1, 0
.endif
.if mask_bpp == 8 || mask_bpp == 16
b exit_label
@@ -449,24 +461,24 @@
tst SRC, #3
bne 142f
.endif
- action process_head, process_tail, exit_label, 0, 1
+ action process_head, process_tail, process_inner_loop, exit_label, 0, 1
.if src_bpp == 8 || src_bpp == 16
b exit_label
142:
- action process_head, process_tail, exit_label, 1, 1
+ action process_head, process_tail, process_inner_loop, exit_label, 1, 1
.endif
.endif
.endm
-.macro end_of_line restore_x, loop_label, last_one
- .if (flags) & FLAG_SPILL_LINE_VARS
+.macro end_of_line restore_x, vars_spilled, loop_label, last_one
+ .if vars_spilled
/* Sadly, GAS doesn't seem have an equivalent of the DCI directive? */
/* This is ldmia sp,{} */
.word 0xE89D0000 | LINE_SAVED_REGS
.endif
subs Y, Y, #1
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if vars_spilled
.if (LINE_SAVED_REGS) & (1<<1)
str Y, [sp]
.endif
@@ -483,7 +495,15 @@
.endif
bhs loop_label
.ifc "last_one",""
- b 199f
+ .if vars_spilled
+ b 197f
+ .else
+ b 198f
+ .endif
+ .else
+ .if (!vars_spilled) && ((flags) & FLAG_SPILL_LINE_VARS)
+ b 198f
+ .endif
.endif
.endm
@@ -498,7 +518,8 @@
newline, \
cleanup, \
process_head, \
- process_tail
+ process_tail, \
+ process_inner_loop
.func fname
.global fname
@@ -621,13 +642,13 @@
fname:
push {r4-r11, lr} /* save all registers */
-#ifdef DEBUG_PARAMS
- push {r0-r7,pc}
-#endif
-
subs Y, Y, #1
blo 199f
+#ifdef DEBUG_PARAMS
+ sub sp, sp, #9*4
+#endif
+
.if src_bpp > 0
ldr SRC, [sp, #ARGS_STACK_OFFSET]
ldr STRIDE_S, [sp, #ARGS_STACK_OFFSET+4]
@@ -637,6 +658,12 @@ fname:
ldr STRIDE_M, [sp, #ARGS_STACK_OFFSET+12]
.endif
+#ifdef DEBUG_PARAMS
+ add Y, Y, #1
+ stmia sp, {r0-r7,pc}
+ sub Y, Y, #1
+#endif
+
init
lsl STRIDE_D, #dst_bpp_shift /* stride in bytes */
@@ -664,7 +691,7 @@ fname:
* (prefetch_distance+1) complete blocks to go. */
sub X, X, #(prefetch_distance+2)*pix_per_block
mov ORIG_W, X
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if (flags) & FLAG_SPILL_LINE_VARS_WIDE
/* This is stmdb sp!,{} */
.word 0xE92D0000 | LINE_SAVED_REGS
.endif
@@ -688,27 +715,36 @@ fname:
leading_15bytes process_head, process_tail
154: /* Destination now 16-byte aligned; we have at least one prefetch on each channel as well as at least one 16-byte output block */
- switch_on_alignment wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, 157f
+ .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+ and SCRATCH, SRC, #31
+ rsb SCRATCH, SCRATCH, #32*prefetch_distance
+ .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+ and SCRATCH, MASK, #31
+ rsb SCRATCH, SCRATCH, #32*prefetch_distance
+ .endif
+ .ifc "process_inner_loop",""
+ switch_on_alignment wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, wide_case_inner_loop, 157f
+ .else
+ switch_on_alignment wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, process_inner_loop, 157f
+ .endif
157: /* Check for another line */
- end_of_line 1, 151b
+ end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b
.endif
.ltorg
160: /* Medium case */
mov ORIG_W, X
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE
/* This is stmdb sp!,{} */
.word 0xE92D0000 | LINE_SAVED_REGS
.endif
161: /* New line */
newline
- .if ((flags) & FLAG_ONLY_PRELOAD_WIDE) == 0
preload_line 0, src_bpp, src_bpp_shift, SRC /* in: X, corrupts: WK0-WK1 */
preload_line 0, mask_bpp, mask_bpp_shift, MASK
preload_line 0, dst_r_bpp, dst_bpp_shift, DST
- .endif
sub X, X, #128/dst_w_bpp /* simplifies inner loop termination */
tst DST, #15
@@ -718,10 +754,10 @@ fname:
leading_15bytes process_head, process_tail
164: /* Destination now 16-byte aligned; we have at least one 16-byte output block */
- switch_on_alignment medium_case_inner_loop_and_trailing_pixels, process_head, process_tail, 167f
+ switch_on_alignment medium_case_inner_loop_and_trailing_pixels, process_head, process_tail,, 167f
167: /* Check for another line */
- end_of_line 1, 161b
+ end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_NON_WIDE), 161b
.ltorg
@@ -729,17 +765,15 @@ fname:
.if dst_w_bpp < 32
mov ORIG_W, X
.endif
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE
/* This is stmdb sp!,{} */
.word 0xE92D0000 | LINE_SAVED_REGS
.endif
171: /* New line */
newline
- .if ((flags) & FLAG_ONLY_PRELOAD_WIDE) == 0
preload_line 1, src_bpp, src_bpp_shift, SRC /* in: X, corrupts: WK0-WK1 */
preload_line 1, mask_bpp, mask_bpp_shift, MASK
preload_line 1, dst_r_bpp, dst_bpp_shift, DST
- .endif
.if dst_w_bpp == 8
tst DST, #3
@@ -766,20 +800,22 @@ fname:
.endif
174: /* Destination now 4-byte aligned; we have 0 or more output bytes to go */
- switch_on_alignment narrow_case_inner_loop_and_trailing_pixels, process_head, process_tail, 177f
+ switch_on_alignment narrow_case_inner_loop_and_trailing_pixels, process_head, process_tail,, 177f
177: /* Check for another line */
- end_of_line %(dst_w_bpp < 32), 171b, last_one
-
-199:
- cleanup
+ end_of_line %(dst_w_bpp < 32), %((flags) & FLAG_SPILL_LINE_VARS_NON_WIDE), 171b, last_one
+197:
.if (flags) & FLAG_SPILL_LINE_VARS
add sp, sp, #LINE_SAVED_REG_COUNT*4
.endif
+198:
+ cleanup
+
#ifdef DEBUG_PARAMS
add sp, sp, #9*4 /* junk the debug copy of arguments */
#endif
+199:
pop {r4-r11, pc} /* exit */
.ltorg
--
1.7.5.4
More information about the Pixman
mailing list