[Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

Mon Aug 24 09:04:25 PDT 2015

On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay <oded.gabbay at gmail.com> wrote:
> I tested the patch on POWER8, ppc64le.
> make check passes, but when I benchmarked the performance using
> lowlevel-blt-bench over_n_8888, I got far worse results than without
> this patch (see numbers below).
> Apparently, the new C fast-path takes precedence over some vmx combine
> fast-paths, thus making performance worse instead of better.

If that's true, it wouldn't be the first time that this issue has arisen.
Pixman is designed to scan all fast paths (i.e. routines which take
source, mask and destination images and perform the whole operation
themselves) irrespective of how well tuned they are for a given platform,
before it starts looking at iter routines (each of which reads or writes
only one of source, mask or destination).

Previously, in response to my attempt to work around a case where this
was happening, Søren wrote:
> [...] the longstanding problem that pixman sometimes won't
> pick the fastest code for some operations. In this case the general
> iter based code will be faster then the C code in pixman-fast-path.c
> because the iter code will use assembly fetchers.
>As a result you end up with a bunch of partial reimplementations of
> general_composite_rect() inside pixman-arm-simd.c.
>Maybe we need to admit failure and make general_composite_rect()
> available for other implementations to use. Essentially we would
> officially provide a way for implementations to say: My iterators are
> faster than pixman-fast-path.c for these specific operations, so just
> go directly to general_composite_rect().
>It's definitely not a pleasant thing to do, but given that nobody is
> likely to have the time/skill combination required to do the necessary
> redesign of pixman's fast path system, it might still be preferable to
> to do this instead of duplicating the code like these patches do.
>With a setup like that, we could also fix the same issue for the
> bilinear SSSE3 fetchers and possibly other cases.

(ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html)

I can't say that any cleaner solution has occurred to me since then.

I just had a quick look at the VMX source file, and it has hardly any
iters defined. My guess would be that what's being used is

noop_init_solid_narrow() from pixman-noop.c
_pixman_iter_get_scanline_noop() from pixman-utils.c
combine_src_u() from pixman-combine32.c

this last one would be responsible for the the bulk of the runtime - and
it palms most of the work off on memcpy(). Presumably the PPC memcpy is
super-optimised?

> Without the patch:
>
> reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills)
> ---
> over_n_8888: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8
> ---
>              over_n_8888 =  L1: 572.29  L2:1038.08  M:1104.10 (
> 17.18%)  HT:447.45  VT:520.82  R:407.92  RT:148.90 (1100Kops/s)

There's something a bit odd about that - it's slower when working within
the caches (especially the L1 cache) than when operating on main memory.
I'd hazard a guess that memcpy() is using some hardware acceleration that
lives between the L1 and L2 caches.

Presumably for patch 3 of this series (over_n_0565) you wouldn't see the
same effect, as that can't be achieved using mempcy().

Ben