[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)
Oded Gabbay
oded.gabbay at gmail.com
Tue Aug 25 05:45:48 PDT 2015
On Mon, Aug 24, 2015 at 7:04 PM, Ben Avison <bavison at riscosopen.org> wrote:
>
> On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay <oded.gabbay at gmail.com> wrote:
>>
>> I tested the patch on POWER8, ppc64le.
>> make check passes, but when I benchmarked the performance using
>> lowlevel-blt-bench over_n_8888, I got far worse results than without
>> this patch (see numbers below).
>> Apparently, the new C fast-path takes precedence over some vmx combine
>> fast-paths, thus making performance worse instead of better.
>
>
> If that's true, it wouldn't be the first time that this issue has arisen.
> Pixman is designed to scan all fast paths (i.e. routines which take
> source, mask and destination images and perform the whole operation
> themselves) irrespective of how well tuned they are for a given platform,
> before it starts looking at iter routines (each of which reads or writes
> only one of source, mask or destination).
>
> Previously, in response to my attempt to work around a case where this
> was happening, Søren wrote:
>>
>> [...] the longstanding problem that pixman sometimes won't
>> pick the fastest code for some operations. In this case the general
>> iter based code will be faster then the C code in pixman-fast-path.c
>> because the iter code will use assembly fetchers.
>> As a result you end up with a bunch of partial reimplementations of
>> general_composite_rect() inside pixman-arm-simd.c.
>> Maybe we need to admit failure and make general_composite_rect()
>> available for other implementations to use. Essentially we would
>> officially provide a way for implementations to say: My iterators are
>> faster than pixman-fast-path.c for these specific operations, so just
>> go directly to general_composite_rect().
>> It's definitely not a pleasant thing to do, but given that nobody is
>> likely to have the time/skill combination required to do the necessary
>> redesign of pixman's fast path system, it might still be preferable to
>> to do this instead of duplicating the code like these patches do.
>> With a setup like that, we could also fix the same issue for the
>> bilinear SSSE3 fetchers and possibly other cases.
>
>
> (ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html)
>
> I can't say that any cleaner solution has occurred to me since then.
I think the more immediate solution, as Soren have suggested on IRC,
is for me to implement the equivalent fast-path in VMX.
I see that it is already implemented in mmx, sse2, mips-dspr2 and arm-neon.
>From looking at the C code, I'm guessing that it is fairly simple to implement.
>
> I just had a quick look at the VMX source file, and it has hardly any
> iters defined. My guess would be that what's being used is
>
> noop_init_solid_narrow() from pixman-noop.c
> _pixman_iter_get_scanline_noop() from pixman-utils.c
> combine_src_u() from pixman-combine32.c
>
> this last one would be responsible for the the bulk of the runtime - and
> it palms most of the work off on memcpy(). Presumably the PPC memcpy is
> super-optimised?
I run perf on lowlevel-blt-bench over_n_8888 and what I got is:
- 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.]
vmx_combine_over_u_no_mask
- vmx_combine_over_u_no_mask
- pixman_image_composite32
- 98.87% pixman_image_composite_wrapper
+ 35.32% bench_L
+ 30.58% bench_M
+ 11.71% bench_RT
+ 8.77% bench_R
+ 7.48% bench_HT
+ 6.13% bench_VT
Next major function was:
16.78% 16.72% lowlevel-blt-be libc-2.17.so [.] __memcpy_power7
>
>> Without the patch:
>>
>> reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills)
>> ---
>> over_n_8888: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8
>> ---
>> over_n_8888 = L1: 572.29 L2:1038.08 M:1104.10 (
>> 17.18%) HT:447.45 VT:520.82 R:407.92 RT:148.90 (1100Kops/s)
>
>
> There's something a bit odd about that - it's slower when working within
> the caches (especially the L1 cache) than when operating on main memory.
> I'd hazard a guess that memcpy() is using some hardware acceleration that
> lives between the L1 and L2 caches.
>
> Presumably for patch 3 of this series (over_n_0565) you wouldn't see the
> same effect, as that can't be achieved using mempcy().
>
> Ben
Where is that patch ? I didn't see it in the mailing list.
Oded
More information about the Pixman
mailing list