[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

Oded Gabbay oded.gabbay at gmail.com
Tue Sep 29 04:24:17 PDT 2015


On Mon, Sep 28, 2015 at 8:52 AM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> On Tue, 22 Sep 2015 16:03:22 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
>
>> On Tue, Sep 22, 2015 at 3:59 PM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
>> > On Mon, 21 Sep 2015 14:22:53 +0300
>> > Oded Gabbay <oded.gabbay at gmail.com> wrote:
>> >
>> >> On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
>> >> <siarhei.siamashka at gmail.com> wrote:
>> >> >
>> >> > On Thu, 10 Sep 2015 12:27:18 +0300
>> >> > Oded Gabbay <oded.gabbay at gmail.com> wrote:
>> >> >
>> >> > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>> >> > > >
>> >> > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
>> >> > > > <siarhei.siamashka at gmail.com> wrote:
>> >> > > > > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
>> >> > > > > Gentoo ppc (32-bit userland) gave the following results:
>> >> > > > >
>> >> > > > > before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
>> >> > > > > after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48
>> >> > > > >
>> >> > > > > Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
>> >> > > > > ---
>> >> > > > >  pixman/pixman-vmx.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > >  1 files changed, 54 insertions(+), 0 deletions(-)
>> >> > > > >
>> >> > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
>> >> > > > > index a9bd024..9e551b3 100644
>> >> > > > > --- a/pixman/pixman-vmx.c
>> >> > > > > +++ b/pixman/pixman-vmx.c
>> >> > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
>> >> > > > >  }
>> >> > > > >
>> >> > > > >  static void
>> >> > > > > +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
>> >> > > > > +                           pixman_composite_info_t *info)
>> >> > > > > +{
>> >> > > > > +    PIXMAN_COMPOSITE_ARGS (info);
>> >> > > > > +    uint32_t *dst_line, *dst;
>> >> > > > > +    uint32_t src, ia;
>> >> > > > > +    int      i, w, dst_stride;
>> >> > > > > +    vector unsigned int vdst, vsrc, via;
>> >> > > > > +
>> >> > > > > +    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
>> >> > > > > +
>> >> > > > > +    if (src == 0)
>> >> > > > > +       return;
>> >> > > > > +
>> >> > > > > +    PIXMAN_IMAGE_GET_LINE (
>> >> > > > > +       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
>> >> > > > > +
>> >> > > > > +    vsrc = (vector unsigned int){src, src, src, src};
>> >> > > > > +    via = negate (splat_alpha (vsrc));
>> >> > > > If we will use the over function (see my next comment), we need to
>> >> > > > remove the negate() from the above statement, as it is done in the
>> >> > > > over function.
>> >> > > >
>> >> > > > > +    ia = ALPHA_8 (~src);
>> >> > > > > +
>> >> > > > > +    while (height--)
>> >> > > > > +    {
>> >> > > > > +       dst = dst_line;
>> >> > > > > +       dst_line += dst_stride;
>> >> > > > > +       w = width;
>> >> > > > > +
>> >> > > > > +       while (w && ((uintptr_t)dst & 15))
>> >> > > > > +       {
>> >> > > > > +           uint32_t d = *dst;
>> >> > > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>> >> > > > > +           *dst++ = d;
>> >> > > > > +           w--;
>> >> > > > > +       }
>> >> > > > > +
>> >> > > > > +       for (i = w / 4; i > 0; i--)
>> >> > > > > +       {
>> >> > > > > +           vdst = pix_multiply (load_128_aligned (dst), via);
>> >> > > > > +           save_128_aligned (dst, pix_add (vsrc, vdst));
>> >> > > >
>> >> > > > Instead of the above two lines, I would simply use the over function
>> >> > > > in vmx, which does exactly that. So:
>> >> > > >                 vdst = over(vsrc, via, load_128_aligned(dst))
>> >> > > >                 save_128_aligned (dst, vdst);
>> >> > > >
>> >> > > > I prefer this as it reuses an existing function which helps
>> >> > > > maintainability, and using it has no impact on performance.
>
> The only difference between these variants is that my variant of the
> patch explicitly moved the calculation of the negated alpha out of the
> loop and is a more accurate representation of the assembly code, which
> is expected to be generated. Your variant relies on the compiler to
> do this job properly. As far as the readability/maintainability is
> concerned, there is no significant difference. There is the same
> number of lines of code. Also my variant of the patch is handling
> the main loop and the leading/trailing pixels in a more uniform
> way. Just compare
>
>     via = negate (splat_alpha (vsrc));
> with
>     ia = ALPHA_8 (~src);
>
> and
>
>     vdst = pix_multiply (load_128_aligned (dst), via);
>     save_128_aligned (dst, pix_add (vsrc, vdst));
> with
>     uint32_t d = *dst;
>     UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>     *dst++ = d;
>
>> >> > > > > +           dst += 4;
>> >> > > > > +       }
>> >> > > > > +
>> >> > > > > +       for (i = w % 4; --i >= 0;)
>> >> > > > > +       {
>> >> > > > > +           uint32_t d = dst[i];
>> >> > > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
>> >> > > > > +           dst[i] = d;
>> >> > > > > +       }
>> >> > > > > +    }
>> >> > > > > +}
>> >> > > > > +
>> >> > > > > +static void
>> >> > > > >  vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
>> >> > > > >                                 pixman_composite_info_t *info)
>> >> > > > >  {
>> >> > > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
>> >> > > > >
>> >> > > > >  static const pixman_fast_path_t vmx_fast_paths[] =
>> >> > > > >  {
>> >> > > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, a8r8g8b8, vmx_composite_over_n_8888),
>> >> > > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, x8r8g8b8, vmx_composite_over_n_8888),
>> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
>> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
>> >> > > > >      PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
>> >> > > > > --
>> >> > > > > 1.7.8.6
>> >> > > > >
>> >
>> >>
>> >> I took your advice and run benchmarks with the *non* trimmed traces.
>> >> It takes 144 minutes to run a benchmark on POWER8, 3.4GHz 8 Cores (a
>> >> raw machine, not VM).
>> >>
>> >> I compared with and without this patch and I got:
>> >>
>> >> image          ocitysmap             659.69    -> 611.71    :  1.08x speedup
>> >> image          xfce4-terminal-a1  2725.22 -> 2547.47  :  1.07x speedup
>> >>
>> >> So I guess we can merge this patch, because I prefer the non-trimmed
>> >> results over the trimmed ones.
>> >> The low-level-blt giving significant improvement is also a good sign.
>> >>
>> >> Therefore:
>> >> Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>
>> >
>> > Hi,
>> >
>> > this is only touching pixman-vmx.c and I don't see anything to complain
>> > about here, so as long as you're happy,
>> > Acked-by Pekka Paalanen <pekka.paalanen at collabora.co.uk>
>> >
>> > Did you overrule your own comments? :-)
>>
>> Oh, right.
>> Now I see I had made some comments above about using existing helper function.
>>
>> Siarhei,
>>
>> Do you object that I will replace the lines above with call to the
>> existing helper function, before pushing this patch ?
>
> I have no objections. Both variants are perfectly fine for me and you
> are the one, who is taking care of the powerpc code.
>
> --
> Best regards,
> Siarhei Siamashka

I pushed your version, as I liked moving the negate out of the main loop:

2876d8d..90e62c0  master -> master

     Oded


More information about the Pixman mailing list