[Pixman] [PATCH] vmx: implement fast path vmx_composite_over_n_8888
Oded Gabbay
oded.gabbay at gmail.com
Sat Sep 5 12:03:38 PDT 2015
On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
> Gentoo ppc (32-bit userland) gave the following results:
>
> before: over_n_8888 = L1: 147.47 L2: 205.86 M:121.07
> after: over_n_8888 = L1: 287.27 L2: 261.09 M:133.48
>
> Signed-off-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> ---
> pixman/pixman-vmx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 54 insertions(+), 0 deletions(-)
>
> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> index a9bd024..9e551b3 100644
> --- a/pixman/pixman-vmx.c
> +++ b/pixman/pixman-vmx.c
> @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 (pixman_implementation_t *imp,
> }
>
> static void
> +vmx_composite_over_n_8888 (pixman_implementation_t *imp,
> + pixman_composite_info_t *info)
> +{
> + PIXMAN_COMPOSITE_ARGS (info);
> + uint32_t *dst_line, *dst;
> + uint32_t src, ia;
> + int i, w, dst_stride;
> + vector unsigned int vdst, vsrc, via;
> +
> + src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
> +
> + if (src == 0)
> + return;
> +
> + PIXMAN_IMAGE_GET_LINE (
> + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
> +
> + vsrc = (vector unsigned int){src, src, src, src};
> + via = negate (splat_alpha (vsrc));
If we will use the over function (see my next comment), we need to
remove the negate() from the above statement, as it is done in the
over function.
> + ia = ALPHA_8 (~src);
> +
> + while (height--)
> + {
> + dst = dst_line;
> + dst_line += dst_stride;
> + w = width;
> +
> + while (w && ((uintptr_t)dst & 15))
> + {
> + uint32_t d = *dst;
> + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> + *dst++ = d;
> + w--;
> + }
> +
> + for (i = w / 4; i > 0; i--)
> + {
> + vdst = pix_multiply (load_128_aligned (dst), via);
> + save_128_aligned (dst, pix_add (vsrc, vdst));
Instead of the above two lines, I would simply use the over function
in vmx, which does exactly that. So:
vdst = over(vsrc, via, load_128_aligned(dst))
save_128_aligned (dst, vdst);
I prefer this as it reuses an existing function which helps
maintainability, and using it has no impact on performance.
> + dst += 4;
> + }
> +
> + for (i = w % 4; --i >= 0;)
> + {
> + uint32_t d = dst[i];
> + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src);
> + dst[i] = d;
> + }
> + }
> +}
> +
> +static void
> vmx_composite_over_8888_8888 (pixman_implementation_t *imp,
> pixman_composite_info_t *info)
> {
> @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER,
>
> static const pixman_fast_path_t vmx_fast_paths[] =
> {
> + PIXMAN_STD_FAST_PATH (OVER, solid, null, a8r8g8b8, vmx_composite_over_n_8888),
> + PIXMAN_STD_FAST_PATH (OVER, solid, null, x8r8g8b8, vmx_composite_over_n_8888),
> PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over_8888_8888),
> PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over_8888_8888),
> PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over_8888_8888),
> --
> 1.7.8.6
>
Indeed, this implementation is much better than what I did.
Apparently, converting sse2 to vmx calls isn't the optimal way.
On my POWER8 machine, I get:
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
L1 572.29 1539.47 +169.00%
L2 1038.08 1549.04 +49.22%
M 1104.1 1522.22 +37.87%
HT 447.45 676.32 +51.15%
VT 520.82 764.82 +46.85%
R 407.92 570.54 +39.87%
RT 148.9 208.77 +40.21%
Kops/s 1100 1418 +28.91%
So, assuming the change above, this patch is:
Reviewed-by: Oded Gabbay <oded.gabbay at gmail.com>
More information about the Pixman
mailing list