[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

Thu Sep 10 09:16:40 PDT 2015

On Thu, 10 Sep 2015 12:27:18 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
> >
> > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> > <siarhei.siamashka at gmail.com> wrote:
> > > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
> > > Gentoo ppc (32-bit userland) gave the following results:
> > >
> > > before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
> > > after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48
> > >
> > > Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> > > ---
> > >  pixman/pixman-vmx.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 files changed, 54 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > > index a9bd024..9e551b3 100644
> > > --- a/pixman/pixman-vmx.c
> > > +++ b/pixman/pixman-vmx.c
> > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
> > >  }
> > >
> > >  static void
> > > +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
> > > +                           pixman_composite_info_t *info)
> > > +{
> > > +    PIXMAN_COMPOSITE_ARGS (info);
> > > +    uint32_t *dst_line, *dst;
> > > +    uint32_t src, ia;
> > > +    int      i, w, dst_stride;
> > > +    vector unsigned int vdst, vsrc, via;
> > > +
> > > +    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> > > +
> > > +    if (src == 0)
> > > +       return;
> > > +
> > > +    PIXMAN_IMAGE_GET_LINE (
> > > +       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > > +
> > > +    vsrc = (vector unsigned int){src, src, src, src};
> > > +    via = negate (splat_alpha (vsrc));
> > If we will use the over function (see my next comment), we need to
> > remove the negate() from the above statement, as it is done in the
> > over function.
> >
> > > +    ia = ALPHA_8 (~src);
> > > +
> > > +    while (height--)
> > > +    {
> > > +       dst = dst_line;
> > > +       dst_line += dst_stride;
> > > +       w = width;
> > > +
> > > +       while (w && ((uintptr_t)dst & 15))
> > > +       {
> > > +           uint32_t d = *dst;
> > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > +           *dst++ = d;
> > > +           w--;
> > > +       }
> > > +
> > > +       for (i = w / 4; i > 0; i--)
> > > +       {
> > > +           vdst = pix_multiply (load_128_aligned (dst), via);
> > > +           save_128_aligned (dst, pix_add (vsrc, vdst));
> >
> > Instead of the above two lines, I would simply use the over function
> > in vmx, which does exactly that. So:
> >                 vdst = over(vsrc, via, load_128_aligned(dst))
> >                 save_128_aligned (dst, vdst);
> >
> > I prefer this as it reuses an existing function which helps
> > maintainability, and using it has no impact on performance.
> >
> > > +           dst += 4;
> > > +       }
> > > +
> > > +       for (i = w % 4; --i >= 0;)
> > > +       {
> > > +           uint32_t d = dst[i];
> > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > +           dst[i] = d;
> > > +       }
> > > +    }
> > > +}
> > > +
> > > +static void
> > >  vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
> > >                                 pixman_composite_info_t *info)
> > >  {
> > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
> > >
> > >  static const pixman_fast_path_t vmx_fast_paths[] =
> > >  {
> > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, a8r8g8b8, vmx_composite_over_n_8888),
> > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, x8r8g8b8, vmx_composite_over_n_8888),
> > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
> > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
> > >      PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
> > > --
> > > 1.7.8.6
> > >
> >
> > Indeed, this implementation is much better than what I did.
> > Apparently, converting sse2 to vmx calls isn't the optimal way.
> > On my POWER8 machine, I get:
> >
> > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> > L1              572.29          1539.47         +169.00%
> > L2              1038.08          1549.04         +49.22%
> > M              1104.1          1522.22         +37.87%
> > HT              447.45          676.32         +51.15%
> > VT              520.82          764.82         +46.85%
> > R              407.92          570.54         +39.87%
> > RT              148.9          208.77         +40.21%
> > Kops/s              1100          1418         +28.91%
> >
> > So, assuming the change above, this patch is:
> >
> > Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>
> 
> 
> Hi Siarhei,

Hi,

> After I fixed my cairo setup (See
> http://lists.freedesktop.org/archives/pixman/2015-September/003987.html),

Interesting. How did it happen to be wrong in the first place? Is there
anything missing or incorrect in the test scripts or usage instructions?

> I went and re-tested your patch with cairo trimmed benchmark against
> current pixman master.
> Unfortunately, it gives a minor slowdown:
> 
> Slowdowns
> =========
> t-firefox-scrolling  1232.30 -> 1295.75 :  1.05x slowdown
> 
> even if I apply your patch over my latest patch-set (that was inspired
> by your patch), I still get a slowdown, albeit in a different trace:
> 
> Slowdowns
> =========
> t-firefox-asteroids  440.01 -> 469.68:  1.07x
> 
> What's your take on this ?

Are these results consistently reproducible across multiple runs?
You can also try to set the cut-off threshold to 1% instead of the
default 5% in the cairo-perf-diff-files tool:

   ./cairo-perf-diff-files --min-change 1% old.txt new.txt

Anyway, this looks like the measurement accuracy may be not very good.
As the readme at https://github.com/ssvb/trimmed-cairo-traces says, the
trimmed traces had been trimmed to run faster on very low end hardware
(such as the Raspberry Pi). Otherwise they would take many hours to
complete. And if you are running benchmarks on a high end POWER8 box,
then it probably makes sense to try the original traces from:
    http://cgit.freedesktop.org/cairo-traces/tree/benchmark
I guess, right now the reported times are probably in the ballpark
of a fraction of a second for each test on your hardware.

Is the system really undisturbed during the test? You can also try to
pin the execution to a single core via "taskset" and check if this
changes anything.

This whole issue definitely needs some investigation.

Profiling the t-firefox-scrolling trace replay on my Playstation3:

# perf report
    32.10%  cairo-perf-trac  [kernel.kallsyms]                          [k] .raw_local_irq_restore
    31.77%  cairo-perf-trac  libc-2.20.so                               [.] _wordcopy_fwd_aligned
    16.01%  cairo-perf-trac  libpixman-1.so.0.33.3                      [.] vmx_composite_over_n_8888_8888_ca
     1.80%  cairo-perf-trac  [kernel.kallsyms]                          [k] .handle_mm_fault
     0.74%  cairo-perf-trac  [kernel.kallsyms]                          [k] .unmap_vmas
     0.68%  cairo-perf-trac  [kernel.kallsyms]                          [k] .do_page_fault
     0.58%  cairo-perf-trac  libcairo.so.2.11200.12                     [.] _cairo_scaled_font_glyph_device_extents
     0.52%  cairo-perf-trac  [kernel.kallsyms]                          [k] .__alloc_pages_nodemask
     0.49%  cairo-perf-trac  [kernel.kallsyms]                          [k] .do_raw_spin_lock
     0.46%  cairo-perf-trac  libpixman-1.so.0.33.3                      [.] pixman_composite_glyphs_no_mask
     0.44%  cairo-perf-trac  [kernel.kallsyms]                          [k] .get_page_from_freelist
     0.42%  cairo-perf-trac  [kernel.kallsyms]                          [k] .page_remove_rmap

# perf report -d libpixman-1.so.0.33.3
    87.37%  cairo-perf-trac  [.] vmx_composite_over_n_8888_8888_ca
     2.53%  cairo-perf-trac  [.] pixman_composite_glyphs_no_mask
     1.98%  cairo-perf-trac  [.] vmx_combine_over_u_no_mask
     1.74%  cairo-perf-trac  [.] bits_image_fetch_bilinear_affine_pad_x8r8g8b8
     1.57%  cairo-perf-trac  [.] vmx_fill
     1.04%  cairo-perf-trac  [.] lookup_glyph
     0.53%  cairo-perf-trac  [.] _pixman_image_get_solid
     0.42%  cairo-perf-trac  [.] pixman_region32_rectangles
     0.39%  cairo-perf-trac  [.] fast_composite_src_memcpy
     0.30%  cairo-perf-trac  [.] hash
     0.25%  cairo-perf-trac  [.] 00008000.got2.plt_pic32.memcpy@@GLIBC_2.0
     0.23%  cairo-perf-trac  [.] pixman_glyph_cache_lookup
     0.19%  cairo-perf-trac  [.] _pixman_image_validate
     0.12%  cairo-perf-trac  [.] 00008000.got2.plt_pic32.pixman_region32_rectangles
     0.11%  cairo-perf-trac  [.] pixman_image_create_solid_fill
     0.10%  cairo-perf-trac  [.] vmx_combine_add_u_no_mask
     0.10%  cairo-perf-trac  [.] pixman_unorm_to_float

The vmx_composite_over_n_8888 fast path is not expected to make any
measurable contribution to the results. The execution time is mostly
dominated by memcpy from glibc and vmx_composite_over_n_8888_8888_ca
from pixman.

-- 
Best regards,
Siarhei Siamashka