[Pixman] [PATCH 1/5] REPEAT_NORMAL support for nearest bilinear fast path

Mon May 23 05:49:02 PDT 2011

On Mon, May 16, 2011 at 3:27 PM, Taekyun Kim <podain77 at gmail.com> wrote:
> Patch 1

I had a look at these patches and tested them a bit. So now I can
provide some (hopefully constructive) feedback.

Would it make sense to handle REPEAT_NORMAL optimizations separately
for nearest and bilinear scaling? I mean, you posted the benchmark
numbers for bilinear scaling, but nearest scaling does not seem to be
covered and it would be interesting to know whether it has actually
improved and how much. Currently nearest scaling is particularly
interesting on ARM devices without NEON instructions support because
it is likely that they can't afford the overhead of bilinear scaling
and have to fallback to nearest. Also there are some coding style
issues in the patches which are better to be addressed.

And below are some comments about the code:

-    if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NORMAL)		\
-    {									\
-	/* Clamp repeating positions inside the actual samples */	\
-	max_vx = src_image->bits.width << 16;				\
-	max_vy = src_image->bits.height << 16;				\
-									\
-	repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx);			\
-	repeat (PIXMAN_REPEAT_NORMAL, &vy, max_vy);			\
-    }									\

-	if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NORMAL) \
-	    repeat (PIXMAN_REPEAT_NORMAL, &vy, max_vy);			\

By removing this code from the nearest scaling template, you partially
remove the functionality which was used for getting normal repeat
support working in the current C implementation of the scanline
functions:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.h?id=pixman-0.22.0#n118
So if the intention is to get rid of the old normal repeat support for
nearest scaling and replace it with a new one, then it also makes
sense to purge normal repeat related parts from that scanline macro
too. Or alternatively, keep both methods and provide some way to
select which one to use in each particular case (it probably makes
sense to convert "have_mask" and "mask_is_solid" boolean macro
arguments into a single "flags" argument and use bit flags for such
configuration tweaks, this is also going to make the code a bit more
readable with flag identifiers instead of TRUE/FALSE). Benchmark of
one method against the other for different types of operations would
be quite interesting to see.

+ /* num_pixels = max(n), where  vx + (n - 1)*unit_x < width_fixed */		\
+ num_pixels = (new_src_width_fixed - vx + unit_x - pixman_fixed_e) /
unit_x;   \
+ if (num_pixels > width_remain)
         \
+     num_pixels = width_remain;
         \

I'm also a bit worried about this part. One thing that makes me feel a
bit uneasy is the following expression: "(new_src_width_fixed - vx +
unit_x - pixman_fixed_e)". Can we be sure that it can never overflow
pixman_fixed_t type? If yes, then some descriptive comment about the
valid range of values and assumptions would be useful here. There are
some checks in pixman code which reject certain range of parameters
for the compositing operation and impose some constraints:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n556
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n572
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n641
We have 'scaling-crash-test.c' file in test directory which tries to
do some nasty scaling related things and trigger some problems:
    http://cgit.freedesktop.org/pixman/tree/test/scaling-crash-test.c?id=pixman-0.22.0
But we might need to extend it with some more test cases specifically
targeting this your expression, or maybe even do some random fuzzing.

One more possible problem here is that we get a division operation in
the inner loop, which may be performed as frequently as once per each
PIXMAN_REPEAT_NORMAL_MIN_WIDTH pixels (8 in your patch) and impact
performance. Division is particularly bad on ARM processors which
don't have a special instruction for it, and it is also relatively
slow for x86. If this division could be avoided by using some
Bresenham's alike algorithm, it probably would be a good improvement.

+	    if (src_image->bits.width < PIXMAN_REPEAT_NORMAL_MIN_WIDTH)	\
+	    {								\
+		for (i=0; i<PIXMAN_REPEAT_NORMAL_MIN_WIDTH; )		\
+		{							\
+		    for (j=0; j<src_image->bits.width; j++, i++ )	\
+		    {							\
+			expended_top[i] = src_line_top[j];		\
+			expended_bottom[i] = src_line_bottom[j];	\
+		    }							\
+		}							\
+		new_src_width = i;					\
+		src_line_top = &expended_top[0];			\
+		src_line_bottom = &expended_bottom[0];			\
+	    }								\

Even though it's likely not a bit issue here, but at least in the case
of bilinear scaling these prepared extended source scanlines can be
reused across different iterations and not regenerated each times. In
the case if we have scale factor close to 1x, each source scanline is
going to be accessed approximately twice, for 2x upscaling - accessed
4x times, etc.

-- 
Best regards,
Siarhei Siamashka