[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

Sun Sep 27 22:52:05 PDT 2015

On Tue, 22 Sep 2015 16:03:22 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> On Tue, Sep 22, 2015 at 3:59 PM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> > On Mon, 21 Sep 2015 14:22:53 +0300
> > Oded Gabbay <oded.gabbay at gmail.com> wrote:
> >
> >> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
> >> <siarhei.siamashka at gmail.com> wrote:
> >> >
> >> > On Thu, 10 Sep 2015 12:27:18 +0300
> >> > Oded Gabbay <oded.gabbay at gmail.com> wrote:
> >> >
> >> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
> >> > > >
> >> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> >> > > > <siarhei.siamashka at gmail.com> wrote:
> >> > > > > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
> >> > > > > Gentoo ppc (32-bit userland) gave the following results:
> >> > > > >
> >> > > > > before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
> >> > > > > after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48
> >> > > > >
> >> > > > > Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> >> > > > > ---
> >> > > > >  pixman/pixman-vmx.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
> >> > > > >
> >> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> >> > > > > index a9bd024..9e551b3 100644
> >> > > > > --- a/pixman/pixman-vmx.c
> >> > > > > +++ b/pixman/pixman-vmx.c
> >> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
> >> > > > >  }
> >> > > > >
> >> > > > >  static void
> >> > > > > +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
> >> > > > > +                           pixman_composite_info_t *info)
> >> > > > > +{
> >> > > > > +    PIXMAN_COMPOSITE_ARGS (info);
> >> > > > > +    uint32_t *dst_line, *dst;
> >> > > > > +    uint32_t src, ia;
> >> > > > > +    int      i, w, dst_stride;
> >> > > > > +    vector unsigned int vdst, vsrc, via;
> >> > > > > +
> >> > > > > +    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> >> > > > > +
> >> > > > > +    if (src == 0)
> >> > > > > +       return;
> >> > > > > +
> >> > > > > +    PIXMAN_IMAGE_GET_LINE (
> >> > > > > +       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> >> > > > > +
> >> > > > > +    vsrc = (vector unsigned int){src, src, src, src};
> >> > > > > +    via = negate (splat_alpha (vsrc));
> >> > > > If we will use the over function (see my next comment), we need to
> >> > > > remove the negate() from the above statement, as it is done in the
> >> > > > over function.
> >> > > >
> >> > > > > +    ia = ALPHA_8 (~src);
> >> > > > > +
> >> > > > > +    while (height--)
> >> > > > > +    {
> >> > > > > +       dst = dst_line;
> >> > > > > +       dst_line += dst_stride;
> >> > > > > +       w = width;
> >> > > > > +
> >> > > > > +       while (w && ((uintptr_t)dst & 15))
> >> > > > > +       {
> >> > > > > +           uint32_t d = *dst;
> >> > > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> >> > > > > +           *dst++ = d;
> >> > > > > +           w--;
> >> > > > > +       }
> >> > > > > +
> >> > > > > +       for (i = w / 4; i > 0; i--)
> >> > > > > +       {
> >> > > > > +           vdst = pix_multiply (load_128_aligned (dst), via);
> >> > > > > +           save_128_aligned (dst, pix_add (vsrc, vdst));
> >> > > >
> >> > > > Instead of the above two lines, I would simply use the over function
> >> > > > in vmx, which does exactly that. So:
> >> > > >                 vdst = over(vsrc, via, load_128_aligned(dst))
> >> > > >                 save_128_aligned (dst, vdst);
> >> > > >
> >> > > > I prefer this as it reuses an existing function which helps
> >> > > > maintainability, and using it has no impact on performance.

The only difference between these variants is that my variant of the
patch explicitly moved the calculation of the negated alpha out of the
loop and is a more accurate representation of the assembly code, which
is expected to be generated. Your variant relies on the compiler to
do this job properly. As far as the readability/maintainability is
concerned, there is no significant difference. There is the same
number of lines of code. Also my variant of the patch is handling
the main loop and the leading/trailing pixels in a more uniform
way. Just compare

    via = negate (splat_alpha (vsrc));
with
    ia = ALPHA_8 (~src);

and

    vdst = pix_multiply (load_128_aligned (dst), via);
    save_128_aligned (dst, pix_add (vsrc, vdst));
with
    uint32_t d = *dst;
    UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
    *dst++ = d;

> >> > > > > +           dst += 4;
> >> > > > > +       }
> >> > > > > +
> >> > > > > +       for (i = w % 4; --i >= 0;)
> >> > > > > +       {
> >> > > > > +           uint32_t d = dst[i];
> >> > > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> >> > > > > +           dst[i] = d;
> >> > > > > +       }
> >> > > > > +    }
> >> > > > > +}
> >> > > > > +
> >> > > > > +static void
> >> > > > >  vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
> >> > > > >                                 pixman_composite_info_t *info)
> >> > > > >  {
> >> > > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
> >> > > > >
> >> > > > >  static const pixman_fast_path_t vmx_fast_paths[] =
> >> > > > >  {
> >> > > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, a8r8g8b8, vmx_composite_over_n_8888),
> >> > > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, x8r8g8b8, vmx_composite_over_n_8888),
> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
> >> > > > > --
> >> > > > > 1.7.8.6
> >> > > > >
> >
> >>
> >> I took your advice and run benchmarks with the *non* trimmed traces.
> >> It takes 144 minutes to run a benchmark on POWER8, 3.4GHz 8 Cores (a
> >> raw machine, not VM).
> >>
> >> I compared with and without this patch and I got:
> >>
> >> image          ocitysmap             659.69    -> 611.71    :  1.08x speedup
> >> image          xfce4-terminal-a1  2725.22 -> 2547.47  :  1.07x speedup
> >>
> >> So I guess we can merge this patch, because I prefer the non-trimmed
> >> results over the trimmed ones.
> >> The low-level-blt giving significant improvement is also a good sign.
> >>
> >> Therefore:
> >> Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>
> >
> > Hi,
> >
> > this is only touching pixman-vmx.c and I don't see anything to complain
> > about here, so as long as you're happy,
> > Acked-by Pekka Paalanen <pekka.paalanen at collabora.co.uk>
> >
> > Did you overrule your own comments? :-)
> 
> Oh, right.
> Now I see I had made some comments above about using existing helper function.
> 
> Siarhei,
> 
> Do you object that I will replace the lines above with call to the
> existing helper function, before pushing this patch ?

I have no objections. Both variants are perfectly fine for me and you
are the one, who is taking care of the powerpc code.

-- 
Best regards,
Siarhei Siamashka