[Pixman] [PATCH 1/5] REPEAT_NORMAL support for nearest bilinear fast path

Tue May 24 08:02:26 PDT 2011

Thanks for the reply!!

2011/5/23 Siarhei Siamashka <siarhei.siamashka at gmail.com>

> On Mon, May 16, 2011 at 3:27 PM, Taekyun Kim <podain77 at gmail.com> wrote:
> > Patch 1
>
> Would it make sense to handle REPEAT_NORMAL optimizations separately
> for nearest and bilinear scaling? I mean, you posted the benchmark
> numbers for bilinear scaling, but nearest scaling does not seem to be
> covered and it would be interesting to know whether it has actually
> improved and how much. Currently nearest scaling is particularly
> interesting on ARM devices without NEON instructions support because
> it is likely that they can't afford the overhead of bilinear scaling
> and have to fallback to nearest. Also there are some coding style
> issues in the patches which are better to be addressed.
>

For sse2 and NEON fast path, nearest was also quite faster than general
path. I will post some performance data for various nearest/bilinear cases
later with your comments are applied.

As you mentioned,  C fast path is slower than before due to function call
overhead and division inside the loop.

To achieve REPEAT_NORMAL support, we can make scanline functions to support
it(Plan A) or we can choose break-down-to-non-repeat approach(Plan B).
Obviously the plan B causes function call overhead, so basically I think
the plan A could be an optimal solution. But it requires a lot of changes in
current fast path implementations.

As an alternative, we can first take plan B (because we can benifit from
already implemented sse2 and NEON fast paths) and later move to plan A one
by one if it is requied. Following is my suggestion for doing that.

Basically nearest/bilinear fast path templates take scanline_func as
an argument. So we handle vertical repeat in macro templates.
scanline_func can handle repeat inside of it or not. So we give following
flags to macro templates which means whether the argument scanline_func
support specific repeat mode or not.

    FAST_PATH_SCANLINE_SUPPORT_REPEAT_NONE
    FAST_PATH_SCANLINE_SUPPORT_REPEAT_PAD
    FAST_PATH_SCANLINE_SUPPORT_REPEAT_NORMAL
    FAST_PATH_SCANLINE_SUPPORT_REPEAT_REFLECT

If all these flags are not turned on,  scanline_func can process only
SAMPLES_COVER_CLIP cases. And as you suggested, mask related boolean values
can also be replaced with flags like

    FAST_PATH_HAVE_MASK
    FAST_PATH_MASK_IS_SOLID

And additionally following flags can be useful for REPEAT_REFLECT and
horizontal flipping.

     FAST_PATH_SCANLINE_SUPPORT_NEGATIVE_X_UNIT

So the FAST_NEAREST_MAINLOOP_INT would be something like this.

    FAST_NEAREST_MAINLOOP_INT(
                     scale_func_name, scanline_func,
                     src_type_t, mask_type_t, dst_type_t, repeat_mode, flag)

If scanline_func supports repeat_mode, then we give unhandled vx, max_vx to
scanline_func(plan A). If not, we can take plan B path (fallback).
REPEAT_REFLECT is not easy to implementable right now, because scanline_func
does not support horizontal flipping now (negative unit_x).

And one more important thing is that current standard(non-scaled) fast paths
do not have common templates. They usually handle vertical and horizontal
loop in one function. REPEAT_NORMAL is relatively easy to implemented by
invoking that function with height = 1. But PAD repeat is somewhat difficult
to implement because the function does not take unit_x. It assumes that
unit_x is always pixman_fixed_1. As you know, scaling functions can
implement PAD repeat easily by giving 0 unit_x to scanline function.

I've already done some REPEAT_NORMAL work for ARM standard fast paths and it
shows great performance gain. But non-scaled PAD repeat(not
SAMPLES_COVER_CLIP) still spend huge amount of time when browsing some
popular web pages.

> + /* num_pixels = max(n), where  vx + (n - 1)*unit_x < width_fixed */
>     \
> + num_pixels = (new_src_width_fixed - vx + unit_x - pixman_fixed_e) /
> unit_x;   \
> + if (num_pixels > width_remain)
>         \
> +     num_pixels = width_remain;
>         \
>
> I'm also a bit worried about this part. One thing that makes me feel a
> bit uneasy is the following expression: "(new_src_width_fixed - vx +
> unit_x - pixman_fixed_e)". Can we be sure that it can never overflow
> pixman_fixed_t type? If yes, then some descriptive comment about the
> valid range of values and assumptions would be useful here. There are
> some checks in pixman code which reject certain range of parameters
> for the compositing operation and impose some constraints:
>
> http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n556
>
> http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n572
>
> http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n641
> We have 'scaling-crash-test.c' file in test directory which tries to
> do some nasty scaling related things and trigger some problems:
>
> http://cgit.freedesktop.org/pixman/tree/test/scaling-crash-test.c?id=pixman-0.22.0
> But we might need to extend it with some more test cases specifically
> targeting this your expression, or maybe even do some random fuzzing.
>
>

We have to find max(n), where vx + n*unit_x < max_vx.
This is same problem finding max(n) where n*b < c.
So n = (b - 1)/c (in integer arithmetic, where b and c is positive)
This passed scaling-test but I figured out that
num_pixels = ((max_vx - vx - pixman_fixed_e) / unit_x) + 1
is correct expression. (new_src_width_fixed is replaced with max_vx)
max_vx - vx - pixman_fixed_e does not cause any overflows, because vx is
positive and less than max_vx. I've added some comment about this.

> One more possible problem here is that we get a division operation in
> the inner loop, which may be performed as frequently as once per each
> PIXMAN_REPEAT_NORMAL_MIN_WIDTH pixels (8 in your patch) and impact
> performance. Division is particularly bad on ARM processors which
> don't have a special instruction for it, and it is also relatively
> slow for x86. If this division could be avoided by using some
> Bresenham's alike algorithm, it probably would be a good improvement.
>
> +           if (src_image->bits.width < PIXMAN_REPEAT_NORMAL_MIN_WIDTH) \
> +           {                                                           \
> +               for (i=0; i<PIXMAN_REPEAT_NORMAL_MIN_WIDTH; )           \
> +               {                                                       \
> +                   for (j=0; j<src_image->bits.width; j++, i++ )       \
> +                   {                                                   \
> +                       expended_top[i] = src_line_top[j];              \
> +                       expended_bottom[i] = src_line_bottom[j];        \
> +                   }                                                   \
> +               }                                                       \
> +               new_src_width = i;                                      \
> +               src_line_top = &expended_top[0];                        \
> +               src_line_bottom = &expended_bottom[0];                  \
> +           }                                                           \
>
> Even though it's likely not a bit issue here, but at least in the case
> of bilinear scaling these prepared extended source scanlines can be
> reused across different iterations and not regenerated each times. In
> the case if we have scale factor close to 1x, each source scanline is
> going to be accessed approximately twice, for 2x upscaling - accessed
> 4x times, etc.
>

I've modified to reuse previous source scanline if it does not need update.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110525/7ac9334c/attachment.htm>