[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

Fri Sep 4 13:36:41 PDT 2015

On Tue, 25 Aug 2015 18:29:59 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavison at riscosopen.org> wrote:
> > On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gabbay at gmail.com>
> > wrote:
> >>>
> >>> [exposing general_composite_rect]
> >>> I can't say that any cleaner solution has occurred to me since then.
> >>
> >>
> >> I think the more immediate solution, as Soren have suggested on IRC,
> >> is for me to implement the equivalent fast-path in VMX.
> >> I see that it is already implemented in mmx, sse2, mips-dspr2 and
> >> arm-neon. From looking at the C code, I'm guessing that it is fairly
> >> simple to implement.
> >
> >
> > Yes, it's definitely one of the simpler fast paths, with only two
> > channels to worry about (source and destination) and with one of them
> > being a constant. I wrote an arm-simd version as well, to add to your
> > list - it's just that it's still waiting to be committed :)
> >
> > I probably ought to get round to exposing general_composite_rect sooner
> > rather than later anyway - it's one of the few things from my mammoth
> > patch series last year that Søren commented on and which I haven't got
> > round to revising yet.
> >
> >>> I just had a quick look at the VMX source file, and it has hardly any
> >>> iters defined. My guess would be that what's being used is
> >>>
> >>> noop_init_solid_narrow() from pixman-noop.c
> >>> _pixman_iter_get_scanline_noop() from pixman-utils.c
> >>> combine_src_u() from pixman-combine32.c
> >>>
> >> I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
> >>
> >> -   48.71%    48.68%  lowlevel-blt-be  lowlevel-blt-bench  [.]
> >> vmx_combine_over_u_no_mask
> >>    - vmx_combine_over_u_no_mask
> >
> >
> > Sorry, my mistake - for some reason I must have thought we were dealing
> > with src_n_8888 rather than over_n_8888. If you can beat the C version
> > using a solid fetcher (which fills a temporary buffer the size of the row
> > with a constant pixel value) and an optimised OVER combiner, then you
> > should be able to do better still if you cut out the temporary buffer and
> > keep the solid colour in registers.
> >
> 
> I implemented over_n_8888 for vmx (adapted from sse2) and run the
> lowlevel benchmark. I got degradation in almost all the benches (on
> POWER8, ppc64le):
> 
> reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> L1          572.29       676.6     +18.23%
> L2          1038.08     672.68    -35.20%
> M           1104.1       682.63    -38.17%
> HT          447.45      269.15    -39.85%
> VT          520.82      357.1      -31.44%
> R            407.92      259.46    -36.39%
> RT          148.9        100.25   -32.67%
> Kops/s    1100         910         -17.27%
> 
> so I'm not inclined on adding this slow-path :)

If it is slower, then you are probably implementing it wrong :)

Please try http://patchwork.freedesktop.org/patch/58669/

-- 
Best regards,
Siarhei Siamashka