[Pixman] [PATCH 2/2] ARMv6: Speed improvement to L1 cache constrained blits.

Mon Jan 14 11:16:41 PST 2013

This was achieved by
* if only one of source and mask was being prefetched and the pixel
  processing doesn't use the SCRATCH register, using it to precalculate the
  preload offset such that it falls in the L2 cache sweet spot
* enabling the line variables to be spilled to the stack but only for
  the wide case, rather than in all cases
* enabling the central loop to be entirely replaced on per-operation basis;
  this benefits blits in particular due to being able to use alternate
  banks of 4 registers to avoid interlocks (but it should be noted that
  these stalls are totally subsumed by the access latency of the L2 cache,
  let alone main memory)

Now the lowlevel-blt-bench L1 test results are comparable (even slightly
better than) the C fast path implementation which uses the system memcpy.

Here is some analysis of the combined effect of two parts of this patch
using lowlevel-blt-bench, listing only those tests that show a
statistically significant change, for brevity:

src_n_8888, src_n_0565, src_n_8, src_x888_8888, src_0565_8888: no change

src_8888_8888:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
L1    349.5   26.1  435.8   34.3   24.7%  100.0%
L2    110.1   10.1  115.0   10.4    4.5%  99.9%
VT     34.3    0.4   34.5    0.5    0.5%  99.6%

src_0565_0565:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
L1    309.8   19.7  394.0   23.7   27.2%  100.0%
L2    117.4    5.7  128.4    5.0    9.4%  100.0%
HT     51.8    0.8   52.2    0.9    0.8%  100.0%
VT     45.9    0.7   46.3    0.7    0.7%  99.9%
R      40.5    0.5   41.1    0.6    1.5%  100.0%
RT     12.1    0.5   13.0    0.4    6.8%  100.0%

src_8_8:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
L1    627.3   29.0  759.3   70.5   21.1%  100.0%
L2    235.8    5.5  260.5    8.8   10.5%  100.0%
HT     59.6    1.3   62.1    1.0    4.2%  100.0%
R      45.1    0.7   48.2    0.8    6.8%  100.0%
RT     12.0    0.3   12.9    0.4    7.4%  100.0%

add_8_8:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
L1    562.1   60.6  541.3   38.9   -3.7%  99.6%
HT     35.9    0.4   43.3    0.6   20.8%  100.0%
VT     34.4    0.5   39.3    0.5   14.1%  100.0%
R      28.5    0.3   35.4    0.4   24.2%  100.0%
RT      9.0    0.2   10.2    0.3   13.0%  100.0%

over_8888_8888:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
L1     37.7    0.5   37.9    0.5    0.7%  100.0%
L2     31.0    0.3   30.6    0.5   -1.3%  100.0%
HT     14.4    0.1   15.4    0.1    7.5%  100.0%
VT     13.7    0.1   14.6    0.1    6.1%  100.0%
R      14.3    0.1   15.8    0.1   10.3%  100.0%
RT      6.6    0.1    7.6    0.1   14.8%  100.0%

over_8888_n_8888:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
HT     11.4    0.1   11.9    0.2    4.5%  100.0%
VT     10.9    0.1   11.3    0.2    3.6%  100.0%
R      11.1    0.1   11.8    0.1    6.6%  100.0%
RT      5.6    0.1    6.2    0.1   10.7%  100.0%

over_n_8_8888:

        Before         After
      Mean  StdDev   Mean StdDev  Change  Confidence
HT     12.4    0.3   14.1    0.1   14.0%  100.0%
VT     11.8    0.1   13.5    0.1   15.0%  100.0%
R      10.9    0.1   12.9    0.2   18.0%  100.0%
RT      5.9    0.1    6.5    0.2   10.6%  100.0%

One point to note is that in combination with my patch from 2012-01-08,
there is now only one result that still shows a statistically significant
regression: the src_0565_0565 L2 test. This is quite an oddity,
considering the L1 and M tests both show an improvement, as do the L2
tests for src_8888_8888 and src_8_8 which have identical inner loops
(albeit with different chances of hitting the aligned and unaligned code
paths).

The cairo-perf-trace results are also generally positive.
t-firefox-chalkboard (which was a 3.9% regression) is now improved by
5.6%, to a net 1.5% improvement. The worst regression here is
t-grads-heat-map, 1.6%. This is the sort of percentage by which I
routinely find results vary randomly on repeat runs, so this may be
possible to explain this away as a fluke result.

This is a reminder of the "after" result in my previous patch:

[ # ]  backend                  test   min(s) median(s) stddev. count
[ # ]  image: pixman 0.29.1
[  0]  image    t-swfdec-giant-steps   13.501   13.521   0.10%    6/6
[  1]  image     t-firefox-asteroids   10.400   10.424   0.12%    5/6
[  2]  image      t-firefox-fishbowl   22.488   22.490   0.01%    5/6
[  3]  image    t-firefox-chalkboard   37.193   37.196   0.01%    5/6
[  4]  image         t-midori-zoomed    6.316    6.362   0.31%    6/6
[  5]  image     t-firefox-scrolling   24.375   24.379   0.01%    4/6
[  6]  image               t-poppler   11.517   11.544   0.13%    5/6
[  7]  image         t-chromium-tabs    4.226    4.245   0.26%    6/6
[  8]  image        t-grads-heat-map    3.727    3.780   0.69%    6/6
[  9]  image  t-firefox-canvas-alpha   18.897   19.082   0.64%    6/6
[ 10]  image     t-firefox-talos-gfx   27.889   27.950   0.34%    6/6
[ 11]  image    t-gnome-terminal-vim   19.411   19.545   0.35%    6/6
[ 12]  image      t-firefox-fishtank   19.103   19.112   0.12%    6/6
[ 13]  image             t-evolution   11.303   11.340   0.20%    6/6
[ 14]  image        t-poppler-reseau   21.678   21.824   0.33%    5/6
[ 15]  image     t-firefox-talos-svg   18.909   18.933   0.08%    6/6
[ 16]  image  t-firefox-planet-gnome   10.936   10.966   0.15%    6/6
[ 17]  image     t-firefox-particles   24.224   24.249   0.07%    6/6
[ 18]  image  t-gnome-system-monitor   13.538   13.584   0.46%    6/6
[ 19]  image        t-firefox-canvas   16.394   16.410   0.10%    6/6
[ 20]  image        t-swfdec-youtube    9.694    9.737   0.31%    6/6
[ 21]  image                  t-gvim   18.313   18.334   0.12%    6/6
[ 22]  image     t-firefox-paintball   19.364   19.392   0.07%    6/6
[ 23]  image     t-xfce4-terminal-a1   22.253   22.409   0.43%    6/6

And these are the results I'm getting after this patch series is applied:

[ # ]  backend                  test   min(s) median(s) stddev. count
[ # ]  image: pixman 0.29.1
[  0]  image    t-swfdec-giant-steps   13.501   13.526   0.14%    5/6
[  1]  image     t-firefox-asteroids   10.395   10.421   0.14%    6/6
[  2]  image      t-firefox-fishbowl   22.564   22.570   0.02%    6/6
[  3]  image    t-firefox-chalkboard   35.220   35.234   0.02%    6/6
[  4]  image         t-midori-zoomed    6.296    6.322   0.18%    6/6
[  5]  image     t-firefox-scrolling   24.367   24.412   0.11%    6/6
[  6]  image               t-poppler   11.389   11.432   0.24%    6/6
[  7]  image         t-chromium-tabs    4.190    4.226   0.77%    6/6
[  8]  image        t-grads-heat-map    3.787    3.794   0.08%    4/6
[  9]  image  t-firefox-canvas-alpha   18.791   18.926   0.71%    6/6
[ 10]  image     t-firefox-talos-gfx   27.934   28.079   0.30%    6/6
[ 11]  image    t-gnome-terminal-vim   19.492   19.636   0.43%    6/6
[ 12]  image      t-firefox-fishtank   19.140   19.155   0.04%    5/6
[ 13]  image             t-evolution   11.375   11.423   0.27%    6/6
[ 14]  image        t-poppler-reseau   21.667   21.794   0.25%    5/6
[ 15]  image     t-firefox-talos-svg   18.921   18.933   0.05%    5/6
[ 16]  image  t-firefox-planet-gnome   11.023   11.033   0.05%    5/6
[ 17]  image     t-firefox-particles   23.925   23.973   0.38%    6/6
[ 18]  image  t-gnome-system-monitor   13.376   13.431   0.24%    6/6
[ 19]  image        t-firefox-canvas   16.400   16.500   0.38%    6/6
[ 20]  image        t-swfdec-youtube    9.621    9.622   0.04%    4/6
[ 21]  image                  t-gvim   18.336   18.400   0.18%    6/6
[ 22]  image     t-firefox-paintball   19.364   19.376   0.04%    6/6
[ 23]  image     t-xfce4-terminal-a1   22.288   22.366   0.35%    6/6
---
 pixman/pixman-arm-simd-asm.S |   53 +++++++++++++-----
 pixman/pixman-arm-simd-asm.h |  122 +++++++++++++++++++++++++++---------------
 2 files changed, 118 insertions(+), 57 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index f043826..a380d8b 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -51,39 +51,64 @@
  *   preload        If outputting 16 bytes causes 64 bytes to be read, whether an extra preload should be output
  */
 
+.macro blit_init
+        line_saved_regs STRIDE_D, STRIDE_S
+.endm
+
 .macro blit_process_head   cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload
         pixld   cond, numbytes, firstreg, SRC, unaligned_src
 .endm
 
+.macro blit_inner_loop  process_head, process_tail, unaligned_src, unaligned_mask, dst_alignment
+    WK4     .req    STRIDE_D
+    WK5     .req    STRIDE_S
+    WK6     .req    MASK
+    WK7     .req    STRIDE_M
+110:    pixld   , 16, 0, SRC, unaligned_src
+        pixld   , 16, 4, SRC, unaligned_src
+        pld     [SRC, SCRATCH]
+        pixst   , 16, 0, DST
+        pixst   , 16, 4, DST
+        subs    X, X, #32*8/src_bpp
+        bhs     110b
+    .unreq  WK4
+    .unreq  WK5
+    .unreq  WK6
+    .unreq  WK7
+.endm
+
 generate_composite_function \
     pixman_composite_src_8888_8888_asm_armv6, 32, 0, 32, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
     3, /* prefetch distance */ \
-    nop_macro, /* init */ \
+    blit_init, \
     nop_macro, /* newline */ \
     nop_macro, /* cleanup */ \
     blit_process_head, \
-    nop_macro /* process tail */
+    nop_macro, /* process tail */ \
+    blit_inner_loop
 
 generate_composite_function \
     pixman_composite_src_0565_0565_asm_armv6, 16, 0, 16, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
     3, /* prefetch distance */ \
-    nop_macro, /* init */ \
+    blit_init, \
     nop_macro, /* newline */ \
     nop_macro, /* cleanup */ \
     blit_process_head, \
-    nop_macro /* process tail */
+    nop_macro, /* process tail */ \
+    blit_inner_loop
 
 generate_composite_function \
     pixman_composite_src_8_8_asm_armv6, 8, 0, 8, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_SPILL_LINE_VARS_WIDE | FLAG_PROCESS_PRESERVES_SCRATCH, \
     3, /* prefetch distance */ \
-    nop_macro, /* init */ \
+    blit_init, \
     nop_macro, /* newline */ \
     nop_macro, /* cleanup */ \
     blit_process_head, \
-    nop_macro /* process tail */
+    nop_macro, /* process tail */ \
+    blit_inner_loop
 
 /******************************************************************************/
 
@@ -125,7 +150,7 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_src_n_8888_asm_armv6, 0, 0, 32, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
     0, /* prefetch distance doesn't apply */ \
     src_n_8888_init \
     nop_macro, /* newline */ \
@@ -135,7 +160,7 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_src_n_0565_asm_armv6, 0, 0, 16, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
     0, /* prefetch distance doesn't apply */ \
     src_n_0565_init \
     nop_macro, /* newline */ \
@@ -145,7 +170,7 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_src_n_8_asm_armv6, 0, 0, 8, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH \
     0, /* prefetch distance doesn't apply */ \
     src_n_8_init \
     nop_macro, /* newline */ \
@@ -176,7 +201,7 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_src_x888_8888_asm_armv6, 32, 0, 32, \
-    FLAG_DST_WRITEONLY | FLAG_COND_EXEC, \
+    FLAG_DST_WRITEONLY | FLAG_COND_EXEC | FLAG_PROCESS_PRESERVES_SCRATCH, \
     3, /* prefetch distance */ \
     nop_macro, /* init */ \
     nop_macro, /* newline */ \
@@ -315,7 +340,7 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_add_8_8_asm_armv6, 8, 0, 8, \
-    FLAG_DST_READWRITE | FLAG_BRANCH_OVER, \
+    FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_PRESERVES_SCRATCH, \
     2, /* prefetch distance */ \
     nop_macro, /* init */ \
     nop_macro, /* newline */ \
diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index ee70131..0ecdc7a 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -42,10 +42,12 @@
 .set FLAG_PROCESS_CORRUPTS_PSR,  4
 .set FLAG_PROCESS_DOESNT_STORE,  0
 .set FLAG_PROCESS_DOES_STORE,    8 /* usually because it needs to conditionally skip it */
-.set FLAG_NO_SPILL_LINE_VARS,    0
-.set FLAG_SPILL_LINE_VARS,       16
-.set FLAG_PRELOAD_ALL_WIDTHS,    0
-.set FLAG_ONLY_PRELOAD_WIDE,     32
+.set FLAG_NO_SPILL_LINE_VARS,        0
+.set FLAG_SPILL_LINE_VARS_WIDE,      16
+.set FLAG_SPILL_LINE_VARS_NON_WIDE,  32
+.set FLAG_SPILL_LINE_VARS,           48
+.set FLAG_PROCESS_CORRUPTS_SCRATCH,  0
+.set FLAG_PROCESS_PRESERVES_SCRATCH, 64
 
 /*
  * Offset into stack where mask and source pointer/stride can be accessed.
@@ -184,12 +186,16 @@
 .endm
 
 #define IS_END_OF_GROUP(INDEX,SIZE) ((SIZE) < 2 || ((INDEX) & ~((INDEX)+1)) & ((SIZE)/2))
-.macro preload_middle   bpp, base
+.macro preload_middle   bpp, base, scratch_holds_offset
  .if bpp > 0
         /* prefetch distance = 256/bpp, stm distance = 128/dst_w_bpp */
   .if IS_END_OF_GROUP(SUBBLOCK,256/128*dst_w_bpp/bpp)
+   .if scratch_holds_offset
+        PF  pld,    [base, SCRATCH]
+   .else
         PF  bic,    SCRATCH, base, #31
         PF  pld,    [SCRATCH, #32*prefetch_distance]
+   .endif
   .endif
  .endif
 .endm
@@ -360,8 +366,14 @@
  .set SUBBLOCK, 0 /* this is a count of STMs; there can be up to 8 STMs per block */
  .rept pix_per_block*dst_w_bpp/128
         process_head  , 16, 0, unaligned_src, unaligned_mask, 1
-        preload_middle  src_bpp, SRC
-        preload_middle  mask_bpp, MASK
+  .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+        preload_middle  src_bpp, SRC, 1
+  .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+        preload_middle  mask_bpp, MASK, 1
+  .else
+        preload_middle  src_bpp, SRC, 0
+        preload_middle  mask_bpp, MASK, 0
+  .endif
   .if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0)
         /* Because we know that writes are 16-byte aligned, it's relatively easy to ensure that
          * destination prefetches are 32-byte aligned. It's also the easiest channel to offset
@@ -380,16 +392,16 @@
         bhs     110b
 .endm
 
-.macro wide_case_inner_loop_and_trailing_pixels  process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro wide_case_inner_loop_and_trailing_pixels  process_head, process_tail, process_inner_loop, exit_label, unaligned_src, unaligned_mask
         /* Destination now 16-byte aligned; we have at least one block before we have to stop preloading */
  .if dst_r_bpp > 0
         tst     DST, #16
         bne     111f
-        wide_case_inner_loop  process_head, process_tail, unaligned_src, unaligned_mask, 16
+        process_inner_loop  process_head, process_tail, unaligned_src, unaligned_mask, 16
         b       112f
 111:
  .endif
-        wide_case_inner_loop  process_head, process_tail, unaligned_src, unaligned_mask, 0
+        process_inner_loop  process_head, process_tail, unaligned_src, unaligned_mask, 0
 112:
         /* Just before the final (prefetch_distance+1) 32-byte blocks, deal with final preloads */
  .if (src_bpp*pix_per_block > 256) || (mask_bpp*pix_per_block > 256) || (dst_r_bpp*pix_per_block > 256)
@@ -400,10 +412,10 @@
         preload_trailing  dst_r_bpp, dst_bpp_shift, DST
         add     X, X, #(prefetch_distance+2)*pix_per_block - 128/dst_w_bpp
         /* The remainder of the line is handled identically to the medium case */
-        medium_case_inner_loop_and_trailing_pixels  process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+        medium_case_inner_loop_and_trailing_pixels  process_head, process_tail,, exit_label, unaligned_src, unaligned_mask
 .endm
 
-.macro medium_case_inner_loop_and_trailing_pixels  process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro medium_case_inner_loop_and_trailing_pixels  process_head, process_tail, unused, exit_label, unaligned_src, unaligned_mask
 120:
         process_head  , 16, 0, unaligned_src, unaligned_mask, 0
         process_tail  , 16, 0
@@ -418,7 +430,7 @@
         trailing_15bytes  process_head, process_tail, unaligned_src, unaligned_mask
 .endm
 
-.macro narrow_case_inner_loop_and_trailing_pixels  process_head, process_tail, exit_label, unaligned_src, unaligned_mask
+.macro narrow_case_inner_loop_and_trailing_pixels  process_head, process_tail, unused, exit_label, unaligned_src, unaligned_mask
         tst     X, #16*8/dst_w_bpp
         conditional_process1  ne, process_head, process_tail, 16, 0, unaligned_src, unaligned_mask, 0
         /* Trailing pixels */
@@ -426,7 +438,7 @@
         trailing_15bytes  process_head, process_tail, unaligned_src, unaligned_mask
 .endm
 
-.macro switch_on_alignment  action, process_head, process_tail, exit_label
+.macro switch_on_alignment  action, process_head, process_tail, process_inner_loop, exit_label
  /* Note that if we're reading the destination, it's already guaranteed to be aligned at this point */
  .if mask_bpp == 8 || mask_bpp == 16
         tst     MASK, #3
@@ -436,11 +448,11 @@
         tst     SRC, #3
         bne     140f
   .endif
-        action  process_head, process_tail, exit_label, 0, 0
+        action  process_head, process_tail, process_inner_loop, exit_label, 0, 0
   .if src_bpp == 8 || src_bpp == 16
         b       exit_label
 140:
-        action  process_head, process_tail, exit_label, 1, 0
+        action  process_head, process_tail, process_inner_loop, exit_label, 1, 0
   .endif
  .if mask_bpp == 8 || mask_bpp == 16
         b       exit_label
@@ -449,24 +461,24 @@
         tst     SRC, #3
         bne     142f
   .endif
-        action  process_head, process_tail, exit_label, 0, 1
+        action  process_head, process_tail, process_inner_loop, exit_label, 0, 1
   .if src_bpp == 8 || src_bpp == 16
         b       exit_label
 142:
-        action  process_head, process_tail, exit_label, 1, 1
+        action  process_head, process_tail, process_inner_loop, exit_label, 1, 1
   .endif
  .endif
 .endm
 
 
-.macro end_of_line      restore_x, loop_label, last_one
- .if (flags) & FLAG_SPILL_LINE_VARS
+.macro end_of_line      restore_x, vars_spilled, loop_label, last_one
+ .if vars_spilled
         /* Sadly, GAS doesn't seem have an equivalent of the DCI directive? */
         /* This is ldmia sp,{} */
         .word   0xE89D0000 | LINE_SAVED_REGS
  .endif
         subs    Y, Y, #1
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if vars_spilled
   .if (LINE_SAVED_REGS) & (1<<1)
         str     Y, [sp]
   .endif
@@ -483,7 +495,15 @@
  .endif
         bhs     loop_label
  .ifc "last_one",""
-        b       199f
+  .if vars_spilled
+        b       197f
+  .else
+        b       198f
+  .endif
+ .else
+  .if (!vars_spilled) && ((flags) & FLAG_SPILL_LINE_VARS)
+        b       198f
+  .endif
  .endif
 .endm
 
@@ -498,7 +518,8 @@
                                    newline, \
                                    cleanup, \
                                    process_head, \
-                                   process_tail
+                                   process_tail, \
+                                   process_inner_loop
 
  .func fname
  .global fname
@@ -621,13 +642,13 @@
 fname:
         push    {r4-r11, lr}        /* save all registers */
 
-#ifdef DEBUG_PARAMS
-        push    {r0-r7,pc}
-#endif
-
         subs    Y, Y, #1
         blo     199f
 
+#ifdef DEBUG_PARAMS
+        sub     sp, sp, #9*4
+#endif
+
  .if src_bpp > 0
         ldr     SRC, [sp, #ARGS_STACK_OFFSET]
         ldr     STRIDE_S, [sp, #ARGS_STACK_OFFSET+4]
@@ -637,6 +658,12 @@ fname:
         ldr     STRIDE_M, [sp, #ARGS_STACK_OFFSET+12]
  .endif
         
+#ifdef DEBUG_PARAMS
+        add     Y, Y, #1
+        stmia   sp, {r0-r7,pc}
+        sub     Y, Y, #1
+#endif
+
         init
         
         lsl     STRIDE_D, #dst_bpp_shift /* stride in bytes */
@@ -664,7 +691,7 @@ fname:
          * (prefetch_distance+1) complete blocks to go. */
         sub     X, X, #(prefetch_distance+2)*pix_per_block
         mov     ORIG_W, X
-  .if (flags) & FLAG_SPILL_LINE_VARS
+  .if (flags) & FLAG_SPILL_LINE_VARS_WIDE
         /* This is stmdb sp!,{} */
         .word   0xE92D0000 | LINE_SAVED_REGS
   .endif
@@ -688,27 +715,36 @@ fname:
         leading_15bytes  process_head, process_tail
         
 154:    /* Destination now 16-byte aligned; we have at least one prefetch on each channel as well as at least one 16-byte output block */
-        switch_on_alignment  wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, 157f
+ .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+        and     SCRATCH, SRC, #31
+        rsb     SCRATCH, SCRATCH, #32*prefetch_distance
+ .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH)
+        and     SCRATCH, MASK, #31
+        rsb     SCRATCH, SCRATCH, #32*prefetch_distance
+ .endif
+ .ifc "process_inner_loop",""
+        switch_on_alignment  wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, wide_case_inner_loop, 157f
+ .else
+        switch_on_alignment  wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, process_inner_loop, 157f
+ .endif
 
 157:    /* Check for another line */
-        end_of_line 1, 151b
+        end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b
  .endif
 
  .ltorg
 
 160:    /* Medium case */
         mov     ORIG_W, X
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE
         /* This is stmdb sp!,{} */
         .word   0xE92D0000 | LINE_SAVED_REGS
  .endif
 161:    /* New line */
         newline
- .if ((flags) & FLAG_ONLY_PRELOAD_WIDE) == 0
         preload_line 0, src_bpp, src_bpp_shift, SRC  /* in: X, corrupts: WK0-WK1 */
         preload_line 0, mask_bpp, mask_bpp_shift, MASK
         preload_line 0, dst_r_bpp, dst_bpp_shift, DST
- .endif
         
         sub     X, X, #128/dst_w_bpp     /* simplifies inner loop termination */
         tst     DST, #15
@@ -718,10 +754,10 @@ fname:
         leading_15bytes  process_head, process_tail
         
 164:    /* Destination now 16-byte aligned; we have at least one 16-byte output block */
-        switch_on_alignment  medium_case_inner_loop_and_trailing_pixels, process_head, process_tail, 167f
+        switch_on_alignment  medium_case_inner_loop_and_trailing_pixels, process_head, process_tail,, 167f
         
 167:    /* Check for another line */
-        end_of_line 1, 161b
+        end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_NON_WIDE), 161b
 
  .ltorg
 
@@ -729,17 +765,15 @@ fname:
  .if dst_w_bpp < 32
         mov     ORIG_W, X
  .endif
- .if (flags) & FLAG_SPILL_LINE_VARS
+ .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE
         /* This is stmdb sp!,{} */
         .word   0xE92D0000 | LINE_SAVED_REGS
  .endif
 171:    /* New line */
         newline
- .if ((flags) & FLAG_ONLY_PRELOAD_WIDE) == 0
         preload_line 1, src_bpp, src_bpp_shift, SRC  /* in: X, corrupts: WK0-WK1 */
         preload_line 1, mask_bpp, mask_bpp_shift, MASK
         preload_line 1, dst_r_bpp, dst_bpp_shift, DST
- .endif
         
  .if dst_w_bpp == 8
         tst     DST, #3
@@ -766,20 +800,22 @@ fname:
  .endif
 
 174:    /* Destination now 4-byte aligned; we have 0 or more output bytes to go */
-        switch_on_alignment  narrow_case_inner_loop_and_trailing_pixels, process_head, process_tail, 177f
+        switch_on_alignment  narrow_case_inner_loop_and_trailing_pixels, process_head, process_tail,, 177f
 
 177:    /* Check for another line */
-        end_of_line %(dst_w_bpp < 32), 171b, last_one
-
-199:
-        cleanup
+        end_of_line %(dst_w_bpp < 32), %((flags) & FLAG_SPILL_LINE_VARS_NON_WIDE), 171b, last_one
 
+197:
  .if (flags) & FLAG_SPILL_LINE_VARS
         add     sp, sp, #LINE_SAVED_REG_COUNT*4
  .endif
+198:
+        cleanup
+
 #ifdef DEBUG_PARAMS
         add     sp, sp, #9*4 /* junk the debug copy of arguments */
 #endif
+199:
         pop     {r4-r11, pc}  /* exit */
 
  .ltorg
-- 
1.7.5.4