[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

Sat Sep 5 13:34:03 PDT 2015

On Fri, Sep 4, 2015 at 11:36 PM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> On Tue, 25 Aug 2015 18:29:59 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
>
>> On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavison at riscosopen.org> wrote:
>> > On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gabbay at gmail.com>
>> > wrote:
>> >>>
>> >>> [exposing general_composite_rect]
>> >>> I can't say that any cleaner solution has occurred to me since then.
>> >>
>> >>
>> >> I think the more immediate solution, as Soren have suggested on IRC,
>> >> is for me to implement the equivalent fast-path in VMX.
>> >> I see that it is already implemented in mmx, sse2, mips-dspr2 and
>> >> arm-neon. From looking at the C code, I'm guessing that it is fairly
>> >> simple to implement.
>> >
>> >
>> > Yes, it's definitely one of the simpler fast paths, with only two
>> > channels to worry about (source and destination) and with one of them
>> > being a constant. I wrote an arm-simd version as well, to add to your
>> > list - it's just that it's still waiting to be committed :)
>> >
>> > I probably ought to get round to exposing general_composite_rect sooner
>> > rather than later anyway - it's one of the few things from my mammoth
>> > patch series last year that Søren commented on and which I haven't got
>> > round to revising yet.
>> >
>> >>> I just had a quick look at the VMX source file, and it has hardly any
>> >>> iters defined. My guess would be that what's being used is
>> >>>
>> >>> noop_init_solid_narrow() from pixman-noop.c
>> >>> _pixman_iter_get_scanline_noop() from pixman-utils.c
>> >>> combine_src_u() from pixman-combine32.c
>> >>>
>> >> I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
>> >>
>> >> -   48.71%    48.68%  lowlevel-blt-be  lowlevel-blt-bench  [.]
>> >> vmx_combine_over_u_no_mask
>> >>    - vmx_combine_over_u_no_mask
>> >
>> >
>> > Sorry, my mistake - for some reason I must have thought we were dealing
>> > with src_n_8888 rather than over_n_8888. If you can beat the C version
>> > using a solid fetcher (which fills a temporary buffer the size of the row
>> > with a constant pixel value) and an optimised OVER combiner, then you
>> > should be able to do better still if you cut out the temporary buffer and
>> > keep the solid colour in registers.
>> >
>>
>> I implemented over_n_8888 for vmx (adapted from sse2) and run the
>> lowlevel benchmark. I got degradation in almost all the benches (on
>> POWER8, ppc64le):
>>
>> reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
>> L1          572.29       676.6     +18.23%
>> L2          1038.08     672.68    -35.20%
>> M           1104.1       682.63    -38.17%
>> HT          447.45      269.15    -39.85%
>> VT          520.82      357.1      -31.44%
>> R            407.92      259.46    -36.39%
>> RT          148.9        100.25   -32.67%
>> Kops/s    1100         910         -17.27%
>>
>> so I'm not inclined on adding this slow-path :)
>
> If it is slower, then you are probably implementing it wrong :)
>
> Please try http://patchwork.freedesktop.org/patch/58669/
>
> --
> Best regards,
> Siarhei Siamashka

Hi Siarhei,

Your remark was absolutely correct !
I did a little investigation by comparing your patch, my version to it
and sse2, and I now understand why my version was much slower.

Because I converted functions from sse2 to vmx, I used their
"expand_alpha_1x128" function to extract the alpha. However, it
extracts both the alpha and the color component that comes after the
alpha. When they negate it, they get rid of the color component by
doing xor with a mask.

This is less efficient than what you did, which is to use the
splat_alpha function and negate it using nor.

In addition (and much worse), the above caused another overhead,
because I needed to use 2 vectors for the result, because the alpha
took 2 bytes instead of 1. This added more commands, which are totally
unnecessary.

Moreover, using vmx for the odd cases (before dst is aligned to 16 and
for the end of the line), has more overhead than using the optimized C
macros.

I will need to go over all my functions to optimize them with these findings.

Thanks again!

      Oded