[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

Tue Aug 25 05:45:48 PDT 2015

On Mon, Aug 24, 2015 at 7:04 PM, Ben Avison <bavison at riscosopen.org> wrote:
>
> On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>>
>> I tested the patch on POWER8, ppc64le.
>> make check passes, but when I benchmarked the performance using
>> lowlevel-blt-bench over_n_8888, I got far worse results than without
>> this patch (see numbers below).
>> Apparently, the new C fast-path takes precedence over some vmx combine
>> fast-paths, thus making performance worse instead of better.
>
>
> If that's true, it wouldn't be the first time that this issue has arisen.
> Pixman is designed to scan all fast paths (i.e. routines which take
> source, mask and destination images and perform the whole operation
> themselves) irrespective of how well tuned they are for a given platform,
> before it starts looking at iter routines (each of which reads or writes
> only one of source, mask or destination).
>
> Previously, in response to my attempt to work around a case where this
> was happening, Søren wrote:
>>
>> [...] the longstanding problem that pixman sometimes won't
>> pick the fastest code for some operations. In this case the general
>> iter based code will be faster then the C code in pixman-fast-path.c
>> because the iter code will use assembly fetchers.
>> As a result you end up with a bunch of partial reimplementations of
>> general_composite_rect() inside pixman-arm-simd.c.
>> Maybe we need to admit failure and make general_composite_rect()
>> available for other implementations to use. Essentially we would
>> officially provide a way for implementations to say: My iterators are
>> faster than pixman-fast-path.c for these specific operations, so just
>> go directly to general_composite_rect().
>> It's definitely not a pleasant thing to do, but given that nobody is
>> likely to have the time/skill combination required to do the necessary
>> redesign of pixman's fast path system, it might still be preferable to
>> to do this instead of duplicating the code like these patches do.
>> With a setup like that, we could also fix the same issue for the
>> bilinear SSSE3 fetchers and possibly other cases.
>
>
> (ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html)
>
> I can't say that any cleaner solution has occurred to me since then.

I think the more immediate solution, as Soren have suggested on IRC,
is for me to implement the equivalent fast-path in VMX.
I see that it is already implemented in mmx, sse2, mips-dspr2 and arm-neon.
>From looking at the C code, I'm guessing that it is fairly simple to implement.

>
> I just had a quick look at the VMX source file, and it has hardly any
> iters defined. My guess would be that what's being used is
>
> noop_init_solid_narrow() from pixman-noop.c
> _pixman_iter_get_scanline_noop() from pixman-utils.c
> combine_src_u() from pixman-combine32.c
>
> this last one would be responsible for the the bulk of the runtime - and
> it palms most of the work off on memcpy(). Presumably the PPC memcpy is
> super-optimised?

I run perf on lowlevel-blt-bench over_n_8888 and what I got is:

-   48.71%    48.68%  lowlevel-blt-be  lowlevel-blt-bench  [.]
vmx_combine_over_u_no_mask
   - vmx_combine_over_u_no_mask
      - pixman_image_composite32
         - 98.87% pixman_image_composite_wrapper
            + 35.32% bench_L
            + 30.58% bench_M
            + 11.71% bench_RT
            + 8.77% bench_R
            + 7.48% bench_HT
            + 6.13% bench_VT

Next major function was:
16.78%    16.72%  lowlevel-blt-be  libc-2.17.so        [.] __memcpy_power7

>
>> Without the patch:
>>
>> reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills)
>> ---
>> over_n_8888: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8
>> ---
>>              over_n_8888 =  L1: 572.29  L2:1038.08  M:1104.10 (
>> 17.18%)  HT:447.45  VT:520.82  R:407.92  RT:148.90 (1100Kops/s)
>
>
> There's something a bit odd about that - it's slower when working within
> the caches (especially the L1 cache) than when operating on main memory.
> I'd hazard a guess that memcpy() is using some hardware acceleration that
> lives between the L1 and L2 caches.
>
> Presumably for patch 3 of this series (over_n_0565) you wouldn't see the
> same effect, as that can't be achieved using mempcy().
>
> Ben

Where is that patch ? I didn't see it in the mailing list.

         Oded