[Pixman] [PATCH 1/4] vmx: optimize scaled_nearest_scanline_vmx_8888_8888_OVER

Siarhei Siamashka siarhei.siamashka at gmail.com
Thu Sep 17 09:35:41 PDT 2015


On Mon, 7 Sep 2015 14:07:19 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> On Mon, Sep 7, 2015 at 2:03 PM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> > On Sun,  6 Sep 2015 18:27:08 +0300
> > Oded Gabbay <oded.gabbay at gmail.com> wrote:
> >
> >> This patch optimizes scaled_nearest_scanline_vmx_8888_8888_OVER and all
> >> the functions it calls (combine1, combine4 and
> >> core_combine_over_u_pixel_vmx).
> >>
> >> The optimization is done by removing use of expand_alpha_1x128 and
> >> expand_alpha_2x128 in favor of splat_alpha and MUL/ADD macros from
> >> pixman_combine32.h.
> >>
> >> Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
> >> 3.4GHz, RHEL 7.2 ppc64le gave the following results:
> >>
> >> reference memcpy speed = 24847.3MB/s (6211.8MP/s for 32bpp fills)
> >>
> >>                 Before          After           Change
> >>               --------------------------------------------
> >> L1              182.05          210.22         +15.47%
> >> L2              180.6           208.92         +15.68%
> >> M               180.52          208.22         +15.34%
> >> HT              130.17          178.97         +37.49%
> >> VT              145.82          184.22         +26.33%
> >> R               104.51          129.38         +23.80%
> >> RT              48.3            61.54          +27.41%
> >> Kops/s          430             504            +17.21%
> >>
> >> Signed-off-by: Oded Gabbay <oded.gabbay at gmail.com>
> >> ---
> >>  pixman/pixman-vmx.c | 80 ++++++++++++-----------------------------------------
> >>  1 file changed, 18 insertions(+), 62 deletions(-)
> >>
> >> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> >> index a9bd024..d9fc5d6 100644
> >> --- a/pixman/pixman-vmx.c
> >> +++ b/pixman/pixman-vmx.c
> >
> >> @@ -646,19 +643,10 @@ static force_inline uint32_t
> >>  combine1 (const uint32_t *ps, const uint32_t *pm)
> >>  {
> >>      uint32_t s = *ps;
> >> +    uint32_t a = ALPHA_8(*pm);
> >
> > pm is dereferenced before checked for NULL.
> >
> >>
> >>      if (pm)
> >> -    {
> >> -     vector unsigned int ms, mm;
> >> -
> >> -     mm = unpack_32_1x128 (*pm);
> >> -     mm = expand_alpha_1x128 (mm);
> >> -
> >> -     ms = unpack_32_1x128 (s);
> >> -     ms = pix_multiply (ms, mm);
> >> -
> >> -     s = pack_1x128_32 (ms);
> >> -    }
> >> +     UN8x4_MUL_UN8(s, a);
> >>
> >>      return s;
> >>  }
> >
> > Thanks,
> > pq
> 
> Thanks for catching that!

Indeed, that was a good catch.

The problem does not get detected by the test suite because this
memory access is optimized out by the compiler. But if we disable
optimizations by setting CFLAGS to "-O0", then the 'scaling-test'
program segfaults. It means that we at least don't have a test
coverage problem here.

-- 
Best regards,
Siarhei Siamashka


More information about the Pixman mailing list