[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)
Siarhei Siamashka
siarhei.siamashka at gmail.com
Fri Sep 4 13:36:41 PDT 2015
On Tue, 25 Aug 2015 18:29:59 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:
> On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavison at riscosopen.org> wrote:
> > On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gabbay at gmail.com>
> > wrote:
> >>>
> >>> [exposing general_composite_rect]
> >>> I can't say that any cleaner solution has occurred to me since then.
> >>
> >>
> >> I think the more immediate solution, as Soren have suggested on IRC,
> >> is for me to implement the equivalent fast-path in VMX.
> >> I see that it is already implemented in mmx, sse2, mips-dspr2 and
> >> arm-neon. From looking at the C code, I'm guessing that it is fairly
> >> simple to implement.
> >
> >
> > Yes, it's definitely one of the simpler fast paths, with only two
> > channels to worry about (source and destination) and with one of them
> > being a constant. I wrote an arm-simd version as well, to add to your
> > list - it's just that it's still waiting to be committed :)
> >
> > I probably ought to get round to exposing general_composite_rect sooner
> > rather than later anyway - it's one of the few things from my mammoth
> > patch series last year that Søren commented on and which I haven't got
> > round to revising yet.
> >
> >>> I just had a quick look at the VMX source file, and it has hardly any
> >>> iters defined. My guess would be that what's being used is
> >>>
> >>> noop_init_solid_narrow() from pixman-noop.c
> >>> _pixman_iter_get_scanline_noop() from pixman-utils.c
> >>> combine_src_u() from pixman-combine32.c
> >>>
> >> I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
> >>
> >> - 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.]
> >> vmx_combine_over_u_no_mask
> >> - vmx_combine_over_u_no_mask
> >
> >
> > Sorry, my mistake - for some reason I must have thought we were dealing
> > with src_n_8888 rather than over_n_8888. If you can beat the C version
> > using a solid fetcher (which fills a temporary buffer the size of the row
> > with a constant pixel value) and an optimised OVER combiner, then you
> > should be able to do better still if you cut out the temporary buffer and
> > keep the solid colour in registers.
> >
>
> I implemented over_n_8888 for vmx (adapted from sse2) and run the
> lowlevel benchmark. I got degradation in almost all the benches (on
> POWER8, ppc64le):
>
> reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
> L1 572.29 676.6 +18.23%
> L2 1038.08 672.68 -35.20%
> M 1104.1 682.63 -38.17%
> HT 447.45 269.15 -39.85%
> VT 520.82 357.1 -31.44%
> R 407.92 259.46 -36.39%
> RT 148.9 100.25 -32.67%
> Kops/s 1100 910 -17.27%
>
> so I'm not inclined on adding this slow-path :)
If it is slower, then you are probably implementing it wrong :)
Please try http://patchwork.freedesktop.org/patch/58669/
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list