[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

Oded Gabbay oded.gabbay at gmail.com
Tue Aug 25 08:29:59 PDT 2015

On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavison at riscosopen.org> wrote:
> On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gabbay at gmail.com>
> wrote:
>>> [exposing general_composite_rect]
>>> I can't say that any cleaner solution has occurred to me since then.
>> I think the more immediate solution, as Soren have suggested on IRC,
>> is for me to implement the equivalent fast-path in VMX.
>> I see that it is already implemented in mmx, sse2, mips-dspr2 and
>> arm-neon. From looking at the C code, I'm guessing that it is fairly
>> simple to implement.
> Yes, it's definitely one of the simpler fast paths, with only two
> channels to worry about (source and destination) and with one of them
> being a constant. I wrote an arm-simd version as well, to add to your
> list - it's just that it's still waiting to be committed :)
> I probably ought to get round to exposing general_composite_rect sooner
> rather than later anyway - it's one of the few things from my mammoth
> patch series last year that Søren commented on and which I haven't got
> round to revising yet.
>>> I just had a quick look at the VMX source file, and it has hardly any
>>> iters defined. My guess would be that what's being used is
>>> noop_init_solid_narrow() from pixman-noop.c
>>> _pixman_iter_get_scanline_noop() from pixman-utils.c
>>> combine_src_u() from pixman-combine32.c
>> I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
>> -   48.71%    48.68%  lowlevel-blt-be  lowlevel-blt-bench  [.]
>> vmx_combine_over_u_no_mask
>>    - vmx_combine_over_u_no_mask
> Sorry, my mistake - for some reason I must have thought we were dealing
> with src_n_8888 rather than over_n_8888. If you can beat the C version
> using a solid fetcher (which fills a temporary buffer the size of the row
> with a constant pixel value) and an optimised OVER combiner, then you
> should be able to do better still if you cut out the temporary buffer and
> keep the solid colour in registers.

I implemented over_n_8888 for vmx (adapted from sse2) and run the
lowlevel benchmark. I got degradation in almost all the benches (on
POWER8, ppc64le):

reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
L1          572.29       676.6     +18.23%
L2          1038.08     672.68    -35.20%
M           1104.1       682.63    -38.17%
HT          447.45      269.15    -39.85%
VT          520.82      357.1      -31.44%
R            407.92      259.46    -36.39%
RT          148.9        100.25   -32.67%
Kops/s    1100         910         -17.27%

so I'm not inclined on adding this slow-path :)


>>> Presumably for patch 3 of this series (over_n_0565) you wouldn't see
>>> the same effect, as that can't be achieved using mempcy().
>> Where is that patch ? I didn't see it in the mailing list.
> My bad again - in my mind, the patches for over_n_8888 and over_n_0565 in
> C and ARMv6 assembly were a group of four and I overlooked the fact that
> when Pekka split them in order to make the benchmarks more robust, he
> only reposted the over_n_8888 ones. My original over_n_0565 patches are
> here:
> http://patchwork.freedesktop.org/patch/49902/
> http://patchwork.freedesktop.org/patch/49903/
> Ben

