[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888

Mon Sep 21 04:22:53 PDT 2015

On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
>
> On Thu, 10 Sep 2015 12:27:18 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
>
> > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
> > >
> > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
> > > <siarhei.siamashka at gmail.com> wrote:
> > > > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
> > > > Gentoo ppc (32-bit userland) gave the following results:
> > > >
> > > > before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
> > > > after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48
> > > >
> > > > Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> > > > ---
> > > >  pixman/pixman-vmx.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 files changed, 54 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> > > > index a9bd024..9e551b3 100644
> > > > --- a/pixman/pixman-vmx.c
> > > > +++ b/pixman/pixman-vmx.c
> > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
> > > >  }
> > > >
> > > >  static void
> > > > +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
> > > > +                           pixman_composite_info_t *info)
> > > > +{
> > > > +    PIXMAN_COMPOSITE_ARGS (info);
> > > > +    uint32_t *dst_line, *dst;
> > > > +    uint32_t src, ia;
> > > > +    int      i, w, dst_stride;
> > > > +    vector unsigned int vdst, vsrc, via;
> > > > +
> > > > +    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> > > > +
> > > > +    if (src == 0)
> > > > +       return;
> > > > +
> > > > +    PIXMAN_IMAGE_GET_LINE (
> > > > +       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> > > > +
> > > > +    vsrc = (vector unsigned int){src, src, src, src};
> > > > +    via = negate (splat_alpha (vsrc));
> > > If we will use the over function (see my next comment), we need to
> > > remove the negate() from the above statement, as it is done in the
> > > over function.
> > >
> > > > +    ia = ALPHA_8 (~src);
> > > > +
> > > > +    while (height--)
> > > > +    {
> > > > +       dst = dst_line;
> > > > +       dst_line += dst_stride;
> > > > +       w = width;
> > > > +
> > > > +       while (w && ((uintptr_t)dst & 15))
> > > > +       {
> > > > +           uint32_t d = *dst;
> > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > +           *dst++ = d;
> > > > +           w--;
> > > > +       }
> > > > +
> > > > +       for (i = w / 4; i > 0; i--)
> > > > +       {
> > > > +           vdst = pix_multiply (load_128_aligned (dst), via);
> > > > +           save_128_aligned (dst, pix_add (vsrc, vdst));
> > >
> > > Instead of the above two lines, I would simply use the over function
> > > in vmx, which does exactly that. So:
> > >                 vdst = over(vsrc, via, load_128_aligned(dst))
> > >                 save_128_aligned (dst, vdst);
> > >
> > > I prefer this as it reuses an existing function which helps
> > > maintainability, and using it has no impact on performance.
> > >
> > > > +           dst += 4;
> > > > +       }
> > > > +
> > > > +       for (i = w % 4; --i >= 0;)
> > > > +       {
> > > > +           uint32_t d = dst[i];
> > > > +           UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> > > > +           dst[i] = d;
> > > > +       }
> > > > +    }
> > > > +}
> > > > +
> > > > +static void
> > > >  vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
> > > >                                 pixman_composite_info_t *info)
> > > >  {
> > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
> > > >
> > > >  static const pixman_fast_path_t vmx_fast_paths[] =
> > > >  {
> > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, a8r8g8b8, vmx_composite_over_n_8888),
> > > > +    PIXMAN_STD_FAST_PATH (OVER, solid,    null, x8r8g8b8, vmx_composite_over_n_8888),
> > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
> > > >      PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
> > > >      PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
> > > > --
> > > > 1.7.8.6
> > > >
> > >
> > > Indeed, this implementation is much better than what I did.
> > > Apparently, converting sse2 to vmx calls isn't the optimal way.
> > > On my POWER8 machine, I get:
> > >
> > > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> > > L1              572.29          1539.47         +169.00%
> > > L2              1038.08          1549.04         +49.22%
> > > M              1104.1          1522.22         +37.87%
> > > HT              447.45          676.32         +51.15%
> > > VT              520.82          764.82         +46.85%
> > > R              407.92          570.54         +39.87%
> > > RT              148.9          208.77         +40.21%
> > > Kops/s              1100          1418         +28.91%
> > >
> > > So, assuming the change above, this patch is:
> > >
> > > Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>
> >
> >
> > Hi Siarhei,
>
> Hi,
>
> > After I fixed my cairo setup (See
> > http://lists.freedesktop.org/archives/pixman/2015-September/003987.html),
>
> Interesting. How did it happen to be wrong in the first place? Is there
> anything missing or incorrect in the test scripts or usage instructions?
>
> > I went and re-tested your patch with cairo trimmed benchmark against
> > current pixman master.
> > Unfortunately, it gives a minor slowdown:
> >
> > Slowdowns
> > =========
> > t-firefox-scrolling  1232.30 -> 1295.75 :  1.05x slowdown
> >
> > even if I apply your patch over my latest patch-set (that was inspired
> > by your patch), I still get a slowdown, albeit in a different trace:
> >
> > Slowdowns
> > =========
> > t-firefox-asteroids  440.01 -> 469.68:  1.07x
> >
> > What's your take on this ?
>
> Are these results consistently reproducible across multiple runs?
> You can also try to set the cut-off threshold to 1% instead of the
> default 5% in the cairo-perf-diff-files tool:
>
>    ./cairo-perf-diff-files --min-change 1% old.txt new.txt
>
> Anyway, this looks like the measurement accuracy may be not very good.
> As the readme at https://github.com/ssvb/trimmed-cairo-traces says, the
> trimmed traces had been trimmed to run faster on very low end hardware
> (such as the Raspberry Pi). Otherwise they would take many hours to
> complete. And if you are running benchmarks on a high end POWER8 box,
> then it probably makes sense to try the original traces from:
>     http://cgit.freedesktop.org/cairo-traces/tree/benchmark
> I guess, right now the reported times are probably in the ballpark
> of a fraction of a second for each test on your hardware.
>

I took your advice and run benchmarks with the *non* trimmed traces.
It takes 144 minutes to run a benchmark on POWER8, 3.4GHz 8 Cores (a
raw machine, not VM).

I compared with and without this patch and I got:

image          ocitysmap             659.69    -> 611.71    :  1.08x speedup
image          xfce4-terminal-a1  2725.22 -> 2547.47  :  1.07x speedup

So I guess we can merge this patch, because I prefer the non-trimmed
results over the trimmed ones.
The low-level-blt giving significant improvement is also a good sign.

Therefore:
Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>

          Oded

> Is the system really undisturbed during the test? You can also try to
> pin the execution to a single core via "taskset" and check if this
> changes anything.
>
> This whole issue definitely needs some investigation.
>
> Profiling the t-firefox-scrolling trace replay on my Playstation3:
>
> # perf report
>     32.10%  cairo-perf-trac  [kernel.kallsyms]                          [k] .raw_local_irq_restore
>     31.77%  cairo-perf-trac  libc-2.20.so                               [.] _wordcopy_fwd_aligned
>     16.01%  cairo-perf-trac  libpixman-1.so.0.33.3                      [.] vmx_composite_over_n_8888_8888_ca
>      1.80%  cairo-perf-trac  [kernel.kallsyms]                          [k] .handle_mm_fault
>      0.74%  cairo-perf-trac  [kernel.kallsyms]                          [k] .unmap_vmas
>      0.68%  cairo-perf-trac  [kernel.kallsyms]                          [k] .do_page_fault
>      0.58%  cairo-perf-trac  libcairo.so.2.11200.12                     [.] _cairo_scaled_font_glyph_device_extents
>      0.52%  cairo-perf-trac  [kernel.kallsyms]                          [k] .__alloc_pages_nodemask
>      0.49%  cairo-perf-trac  [kernel.kallsyms]                          [k] .do_raw_spin_lock
>      0.46%  cairo-perf-trac  libpixman-1.so.0.33.3                      [.] pixman_composite_glyphs_no_mask
>      0.44%  cairo-perf-trac  [kernel.kallsyms]                          [k] .get_page_from_freelist
>      0.42%  cairo-perf-trac  [kernel.kallsyms]                          [k] .page_remove_rmap
>
> # perf report -d libpixman-1.so.0.33.3
>     87.37%  cairo-perf-trac  [.] vmx_composite_over_n_8888_8888_ca
>      2.53%  cairo-perf-trac  [.] pixman_composite_glyphs_no_mask
>      1.98%  cairo-perf-trac  [.] vmx_combine_over_u_no_mask
>      1.74%  cairo-perf-trac  [.] bits_image_fetch_bilinear_affine_pad_x8r8g8b8
>      1.57%  cairo-perf-trac  [.] vmx_fill
>      1.04%  cairo-perf-trac  [.] lookup_glyph
>      0.53%  cairo-perf-trac  [.] _pixman_image_get_solid
>      0.42%  cairo-perf-trac  [.] pixman_region32_rectangles
>      0.39%  cairo-perf-trac  [.] fast_composite_src_memcpy
>      0.30%  cairo-perf-trac  [.] hash
>      0.25%  cairo-perf-trac  [.] 00008000.got2.plt_pic32.memcpy@@GLIBC_2.0
>      0.23%  cairo-perf-trac  [.] pixman_glyph_cache_lookup
>      0.19%  cairo-perf-trac  [.] _pixman_image_validate
>      0.12%  cairo-perf-trac  [.] 00008000.got2.plt_pic32.pixman_region32_rectangles
>      0.11%  cairo-perf-trac  [.] pixman_image_create_solid_fill
>      0.10%  cairo-perf-trac  [.] vmx_combine_add_u_no_mask
>      0.10%  cairo-perf-trac  [.] pixman_unorm_to_float
>
> The vmx_composite_over_n_8888 fast path is not expected to make any
> measurable contribution to the results. The execution time is mostly
> dominated by memcpy from glibc and vmx_composite_over_n_8888_8888_ca
> from pixman.
>
> --
> Best regards,
> Siarhei Siamashka