[Pixman] [PATCH 0/2] REPEAT_NORMAL support for scaled bilinear functions

Sun Jun 5 15:58:23 PDT 2011

On Tue, May 31, 2011 at 4:52 PM, Taekyun Kim <podain77 at gmail.com> wrote:
> Hi,
> I have modified previous REPEAT_NORMAL patches based on siarhei's
> suggestions.

OK, thanks for the patches.

> You can find previous patches from here
> http://lists.freedesktop.org/archives/pixman/2011-May/001249.html
> This patch set is focused on "Bilinear Scaling". Handling REPEAT_NORMAL
> inside scanline functions can be more effective than stitching approach. And
> it has already done for nearest scaling by siarhei siamashka.
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=nearest-normal-repeat
>
> Major changes compared to previous patches are
> 1. Flags to configure the macro template rather than boolean values
> 2. Possible overflows are eliminated
> 3. Extended source scanlines are reused when they do not need to be updated

One of the immediate problems is that your patches break solid mask
support for bilinear scaling template. It went unnoticed for you just
because there are no functions to use it yet. I had a C fast path, but
it was not applied to to git master (you can cherry pick it, update
and use for testing):
    http://cgit.freedesktop.org/~siamashka/pixman/commit/?h=sent/bilinear-scaling-simd-20110222&id=55c58ab9d61f1575
The whole point is that any functionality is better to be stressed by
some tests. When I implemented initial bilinear main loop template, it
was tested together with the C functions from that branch. And that's
actually one of the reasons why you did not have to deal with any bugs
in this code when adding new NEON bilinear fast paths using a8 mask ;)

One way to solve the problem is to add some useful bilinear fast path
for solid mask, maybe SSE2 would be a good one because it would be
quite easy to test for everyone. That C fast path was not pushed to
git master mainly because of
    http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00888.html

> I have added following flags.
> FLAG_NONE
> FLAG_SCANLINE_SUPPORT_REPEAT_NORMAL
> FLAG_HAVE_SOLID_MASK
> FLAG_HAVE_NON_SOLID_MASK
> FLAG_SCANLINE_SUPPORT_REPEAT_NORMAL means that template parameter
> 'scanline_func' can handle normal repeat inside of it. So if this flag is
> on, the template bypasses repeat handling. Later we can configure macro
> template not to use stitching by turning this flag on.

Sorry for causing a bit of confusion. I mainly suggested to have such
SCANLINE_SUPPORT_REPEAT_NORMAL flag for selecting between "plan a" and
"plan b" for nearest scaling (because handling normal repeat in
scanline functions existed for C fast paths long ago, and NEON code
can be tweaked to use it). Now this flag got somewhat less obviously
useful, even though we can try handling normal repeat in bilinear
scanline functions too. And I'm a bit worried whether we can verify
correctness of this code right now and make sure that it does not bit
rot. Do you have any plans to try implementing normal repeat inside of
bilinear scanline functions yourself?

> But I'm a bit confused that scanline function may expect positive vx and
> unit_x > 0. Handling repeat normal using while( vx >= max_vx ) vx -= max_vx
> requires that assumption. This also means that such scanline functions
> cannot handle negative unit_x. And vx should be handled at least once
> outside the scanline functions (to get positive vx).
> Current FAST_NEAREST_SCANLINE(Used by C fast paths) can handle only positive
> unit_x and currently we achieve this assumption by setting
> FAST_PATH_X_UNIT_POSITIVE flag. FAST_NEAREST/BILINEAR_MAINLOOP_INT does not
> know whether the scanline can handle negative unit_x, so the correctness is
> achieved by setting proper fast path flags. I'm a bit worried about this.

This unit_x > 0 specialization was considered to be a good idea
earlier (sacrifice some flexibility, but gain more performance):
    http://cgit.freedesktop.org/pixman/commit/?id=265ea1fb4d05a920

Though it may be a good idea to change this because supporting
negative 'unit_x' is needed for getting good performance at:
    http://ie.microsoft.com/testdrive/Performance/FishIETank/

> Previously I proposed FLAG_SCANLINE_SUPPORT_NEGATIVE_UNIT_X to explicitly
> control this and I think it is more readable. Can give me some comments
> about this?

I just would prefer not to have different variants of scanline
functions and handle all the complexity in the main loop template.

Ordinary NONE/PAD scanline functions even do not care about the sign
of 'unit_x' variable. They just start at 'vx' and add 'unit_x' to it
after each pixel, so we walk backwards but that's still fine.

Scanline functions for NORMAL repeat are a bit more tricky but also
possible. If you are interested, I will explain it with a bit more
details (this can be done similar to the code from my
'nearest-normal-repeat' branch).

> Mask related flags are slightly changed to (NONE, SOLID_MASK,
> NON_SOLID_MASK). I prefer this kind of convention because it strictly
> identifies the configuration by a single flag.
> Below is some performance benchmark result of sse2 and NEON fast paths.
> It seems that effect of scanline reusing is not that good than I expected
> (maybe there're some mistakes in my patch).

I think that scanline reusing code is a bit more complex than it can
be. I will provide a bit more detailed comment later.

> I think this is because
> previously used scanline is supposed to be in L1 cache or
> REPEAT_NORMAL_MIN_WIDTH is too small. When applying this approach to
> non-scaled functions, performance was even better than reference compositing
> because memory access was the major bottleneck.
> ----- sse2 benchmark -----
> //////////////////////////////////////////////////////////////////////////////
> // op=SRC, src=a8r8g8b8, mask=None, dst=a8r8g8b8
> /////////////////////////////////////////////////////////////////////////////
> <<<<< Reference Compositing Performance 2000x2000 to 2000x2000 >>>>>
> Non-scaled : 372.30Mpix/s Nearest-scaled : 348.53Mpix/s Bilinear-scaled :
> 122.59Mpix/s <<<<< src = 1 x 512 dst = 512 x 512 >>>>> - Bilinear-Scaled
> (close to 1.x)- Before : REPEAT_NORMAL : 28.16Mpix/s
> After : REPEAT_NORMAL : 38.46Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 74.70Mpix/s (use source line extension but not
> reusing)
> After : REPEAT_NORMAL : 75.88Mpix/s
> <<<<< src = 15 x 15 dst = 512 x 512 >>>>> - Bilinear-Scaled (close to 1.x)-
> Before : REPEAT_NORMAL : 24.56Mpix/s
> After : REPEAT_NORMAL : 92.69Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 89.70Mpix/s (use source line extension but not
> reusing)
> After : REPEAT_NORMAL : 89.33Mpix/s
> <<<<< src = 63 x 63 dst = 512 x 512 >>>>> - Bilinear-Scaled (close to 1.x)-
> Before : REPEAT_NORMAL : 24.78Mpix/s
> After : REPEAT_NORMAL : 114.72Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 114.53Mpix/s (use source line extension but not
> reusing)
> After : REPEAT_NORMAL : 114.25Mpix/s
> ----- ARM NEON benchmark -----
> //////////////////////////////////////////////////////////////////////////////
> // op=SRC, src=a8r8g8b8, mask=None, dst=a8r8g8b8
> /////////////////////////////////////////////////////////////////////////////
> <<<<< Reference Compositing Performance 2000x2000 to 2000x2000 >>>>>
> Non-scaled : 158.33Mpix/s Nearest-scaled : 144.23Mpix/s Bilinear-scaled :
> 99.89Mpix/s <<<<< src = 1 x 512 dst = 512 x 512 >>>>> - Bilinear-Scaled
> (close to 1.x)- Before : REPEAT_NORMAL : 5.64Mpix/s
> After : REPEAT_NORMAL : 11.25Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 37.08Mpix/s (use source line extension but not
> reusing)
> After : REPEAT_NORMAL : 36.73Mpix/s

Getting just a little more than ~1/3 of performance of ordinary
bilinear scaling does not look very good. Have you tried increasing
REPEAT_NORMAL_MIN_WIDTH? The reuse of previously generated temporary
scanlines should be surely more effective if the sizes of these
buffers get larger. Also have you tried getting rid of division? I
don't mean that it is strictly necessary, but at least we should try.
Because I suspect that it is the main performance killer for this code
on ARM.

<<<<< src = 15 x 15 dst = 512 x 512
>>>>>> - Bilinear-Scaled (close to 1.x)- Before : REPEAT_NORMAL : 3.78Mpix/s
> After : REPEAT_NORMAL : 50.38Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 51.36Mpix/s (use source line extension but not
> reusing) After : REPEAT_NORMAL : 50.96Mpix/s
> <<<<< src = 63 x 63 dst = 512 x 512 >>>>> - Bilinear-Scaled (close to 1.x)-
> Before : REPEAT_NORMAL : 4.13Mpix/s
> After : REPEAT_NORMAL : 82.80Mpix/s (without source line extension)
> After : REPEAT_NORMAL : 83.65Mpix/s (use source line extension but not
> reusing)
> After : REPEAT_NORMAL : 83.61Mpix/s

Also about the patches themselves. Currently you split them into two
parts, which is already better for bisecting than having just one
patch. The first one introduces some changes to the template and
testing it we can see whether any existing functionality is affected.
The second is enabling new normal repeated bilinear fast paths. And
this is already useful - the breakage for solid mask is introduced by
the first patch, which helps a bit when making an initial guess about
what could be wrong there.

But in my opinion even better split would be:
1. Replace boolean arguments with flags (that's a code cleanup which
is supposed to have no functional changes)
2. Introduce normal repeat support in bilinear template
3. Introduce the REPEAT_NORMAL_MIN_WIDTH optimization using temporary buffers
4. Enable normal repeat bilinear fast paths

-- 
Best regards,
Siarhei Siamashka