[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)
Oded Gabbay
oded.gabbay at gmail.com
Tue Aug 25 08:29:59 PDT 2015
On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavison at riscosopen.org> wrote:
> On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gabbay at gmail.com>
> wrote:
>>>
>>> [exposing general_composite_rect]
>>> I can't say that any cleaner solution has occurred to me since then.
>>
>>
>> I think the more immediate solution, as Soren have suggested on IRC,
>> is for me to implement the equivalent fast-path in VMX.
>> I see that it is already implemented in mmx, sse2, mips-dspr2 and
>> arm-neon. From looking at the C code, I'm guessing that it is fairly
>> simple to implement.
>
>
> Yes, it's definitely one of the simpler fast paths, with only two
> channels to worry about (source and destination) and with one of them
> being a constant. I wrote an arm-simd version as well, to add to your
> list - it's just that it's still waiting to be committed :)
>
> I probably ought to get round to exposing general_composite_rect sooner
> rather than later anyway - it's one of the few things from my mammoth
> patch series last year that Søren commented on and which I haven't got
> round to revising yet.
>
>>> I just had a quick look at the VMX source file, and it has hardly any
>>> iters defined. My guess would be that what's being used is
>>>
>>> noop_init_solid_narrow() from pixman-noop.c
>>> _pixman_iter_get_scanline_noop() from pixman-utils.c
>>> combine_src_u() from pixman-combine32.c
>>>
>> I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
>>
>> - 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.]
>> vmx_combine_over_u_no_mask
>> - vmx_combine_over_u_no_mask
>
>
> Sorry, my mistake - for some reason I must have thought we were dealing
> with src_n_8888 rather than over_n_8888. If you can beat the C version
> using a solid fetcher (which fills a temporary buffer the size of the row
> with a constant pixel value) and an optimised OVER combiner, then you
> should be able to do better still if you cut out the temporary buffer and
> keep the solid colour in registers.
>
I implemented over_n_8888 for vmx (adapted from sse2) and run the
lowlevel benchmark. I got degradation in almost all the benches (on
POWER8, ppc64le):
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
L1 572.29 676.6 +18.23%
L2 1038.08 672.68 -35.20%
M 1104.1 682.63 -38.17%
HT 447.45 269.15 -39.85%
VT 520.82 357.1 -31.44%
R 407.92 259.46 -36.39%
RT 148.9 100.25 -32.67%
Kops/s 1100 910 -17.27%
so I'm not inclined on adding this slow-path :)
Oded
>>> Presumably for patch 3 of this series (over_n_0565) you wouldn't see
>>> the same effect, as that can't be achieved using mempcy().
>>
>>
>> Where is that patch ? I didn't see it in the mailing list.
>
>
> My bad again - in my mind, the patches for over_n_8888 and over_n_0565 in
> C and ARMv6 assembly were a group of four and I overlooked the fact that
> when Pekka split them in order to make the benchmarks more robust, he
> only reposted the over_n_8888 ones. My original over_n_0565 patches are
> here:
>
> http://patchwork.freedesktop.org/patch/49902/
> http://patchwork.freedesktop.org/patch/49903/
>
> Ben
More information about the Pixman
mailing list