[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

Thu Sep 10 02:27:18 PDT 2015

On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>
> On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
> > Gentoo ppc (32-bit userland) gave the following results:
> >
> > before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
> > after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48
> >
> > Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> > ---
> >  pixman/pixman-vmx.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 54 insertions(+), 0 deletions(-)
> >
> > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > index a9bd024..9e551b3 100644
> > --- a/pixman/pixman-vmx.c
> > +++ b/pixman/pixman-vmx.c
> > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
> >  }
> >
> >  static void
> > +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
> > +                           pixman_composite_info_t *info)
> > +{
> > +    PIXMAN_COMPOSITE_ARGS (info);
> > +    uint32_t *dst_line, *dst;
> > +    uint32_t src, ia;
> > +    int      i, w, dst_stride;
> > +    vector unsigned int vdst, vsrc, via;
> > +
> > +    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> > +
> > +    if (src == 0)
> > +       return;
> > +
> > +    PIXMAN_IMAGE_GET_LINE (
> > +       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > +
> > +    vsrc = (vector unsigned int){src, src, src, src};
> > +    via = negate (splat_alpha (vsrc));
> If we will use the over function (see my next comment), we need to
> remove the negate() from the above statement, as it is done in the
> over function.
>
> > +    ia = ALPHA_8 (~src);
> > +
> > +    while (height--)
> > +    {
> > +       dst = dst_line;
> > +       dst_line += dst_stride;
> > +       w = width;
> > +
> > +       while (w && ((uintptr_t)dst & 15))
> > +       {
> > +           uint32_t d = *dst;
> > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > +           *dst++ = d;
> > +           w--;
> > +       }
> > +
> > +       for (i = w / 4; i > 0; i--)
> > +       {
> > +           vdst = pix_multiply (load_128_aligned (dst), via);
> > +           save_128_aligned (dst, pix_add (vsrc, vdst));
>
> Instead of the above two lines, I would simply use the over function
> in vmx, which does exactly that. So:
>                 vdst = over(vsrc, via, load_128_aligned(dst))
>                 save_128_aligned (dst, vdst);
>
> I prefer this as it reuses an existing function which helps
> maintainability, and using it has no impact on performance.
>
> > +           dst += 4;
> > +       }
> > +
> > +       for (i = w % 4; --i >= 0;)
> > +       {
> > +           uint32_t d = dst[i];
> > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > +           dst[i] = d;
> > +       }
> > +    }
> > +}
> > +
> > +static void
> >  vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
> >                                 pixman_composite_info_t *info)
> >  {
> > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
> >
> >  static const pixman_fast_path_t vmx_fast_paths[] =
> >  {
> > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, a8r8g8b8, vmx_composite_over_n_8888),
> > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, x8r8g8b8, vmx_composite_over_n_8888),
> >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
> >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
> >      PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
> > --
> > 1.7.8.6
> >
>
> Indeed, this implementation is much better than what I did.
> Apparently, converting sse2 to vmx calls isn't the optimal way.
> On my POWER8 machine, I get:
>
> reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> L1              572.29          1539.47         +169.00%
> L2              1038.08          1549.04         +49.22%
> M              1104.1          1522.22         +37.87%
> HT              447.45          676.32         +51.15%
> VT              520.82          764.82         +46.85%
> R              407.92          570.54         +39.87%
> RT              148.9          208.77         +40.21%
> Kops/s              1100          1418         +28.91%
>
> So, assuming the change above, this patch is:
>
> Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>

Hi Siarhei,

After I fixed my cairo setup (See
http://lists.freedesktop.org/archives/pixman/2015-September/003987.html),
I went and re-tested your patch with cairo trimmed benchmark against
current pixman master.
Unfortunately, it gives a minor slowdown:

Slowdowns
=========
t-firefox-scrolling  1232.30 -> 1295.75 :  1.05x slowdown

even if I apply your patch over my latest patch-set (that was inspired
by your patch), I still get a slowdown, albeit in a different trace:

Slowdowns
=========
t-firefox-asteroids  440.01 -> 469.68:  1.07x

What's your take on this ?

         Oded