[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Siarhei Siamashka siarhei.siamashka at gmail.com
Fri Aug 13 10:53:59 PDT 2010


On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote:
> Hi,
> 
> piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or"
> operations. With ssse3_memcpy, the performance is increased a little.
> 
> Reference: http://bugs.meego.com/show_bug.cgi?id=5012
> 
> Quote:
> > After optimization, we can find following speed up:
> > 1.    fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run.
> > It now consumes 0.8% of whole system, comparing with 2.3% w/o
> > optimization. 2.    system C0 is reduced around 1.5%
> > 3.    Global CPU utilization is reduced around 1.5% of whole system
> 
> Here is patch, please review it, thanks!

Thanks for the patch. This performance problem has been known for some time
already and there is a bug about it (with some useful comments):
https://bugs.freedesktop.org/show_bug.cgi?id=20709

Out of curiosity, how much of the performance gain is provided by using 
MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via MOVUPS
on Intel Atom? Is it also beneficial on other processors supporting SSSE3
such as Intel Core i7?

That said, I surely appreciate the intent of getting the absolutely best
performance whenever it is possible. I also tried to find the best possible
memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize
the code to support many types of different compositing operations (in
'pixman-arm-neon-asm.*').

Now back to your patch. It is not a compete review, I just wanted to comment a
few things which stick out at the moment.

As I see, the biggest problems in it are:

1. This patch introduces pixman build failure on 64-bit systems, because
'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit 
assumptions.

2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. 
Right now, applying this patch will just make default pixman build unusable on 
the processors which do not support SSSE3.

3. Linking 'pixman-access-ssse3.o' object file with the others enforces the
use of executable stack, which is a bad thing. More details can be found here 
for example:
http://www.gentoo.org/proj/en/hardened/gnu-stack.xml

Less important things spotted by reading the code are below.

> From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001
> From: Liu Xinyun <xinyun.liu at intel.com>
> Date: Wed, 11 Aug 2010 16:31:47 +0800
> Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 instruction
> 
> Signed-off-by: Liu Xinyun <xinyun.liu at intel.com>
> Signed-off-by: Xu, Samuel <samuel.xu at intel.com>
> Signed-off-by: Ma, ling <ling.ma at intel.com>
> ---
>  configure.ac                 |   56 +++
>  pixman/Makefile.am           |   14 +
>  pixman/pixman-access-ssse3.S | 1119
> ++++++++++++++++++++++++++++++++++++++++++ pixman/pixman-access.c       | 
>   6 +
>  4 files changed, 1195 insertions(+), 0 deletions(-)
>  create mode 100644 pixman/pixman-access-ssse3.S

[...]
 
> +	.section .text.ssse3,"ax", at progbits
> +ENTRY (MEMCPY_OR)
> +	ENTRANCE
> +	movl	LEN(%esp), %ecx
> +	movl	SRC(%esp), %eax
> +	movl	DEST(%esp), %edx
> +
> +	cmp	$48, %ecx
> +	jae	L(48bytesormore)
> +
> +	cmp	%dl, %al
> +	jb	L(bk_write)
> +	add	%ecx, %edx
> +	add	%ecx, %eax
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %ecx, 4)
> +L(bk_write):
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %ecx, 4)

Backwards copy is not strictly needed for 'fetch_scanline_x8r8g8b8' at the 
moment. And probably will never be because scanlines get fetched into a 
temporary buffer.

> +
> +	ALIGN (4)
> +/* ECX > 32 and EDX is 4 byte aligned.  */
> +L(48bytesormore):
> +	movdqu	(%eax), %xmm0
> +	PUSH (%edi)
> +	mov	$0xff000000, %edi
> +	movd	%edi, %xmm6
> +	movl	%edx, %edi
> +	and	$-16, %edx
> +	PUSH (%esi)
> +	add	$16, %edx
> +	movl	%edi, %esi
> +	sub	%edx, %edi
> +	add	%edi, %ecx
> +	sub	%edi, %eax
> +
> +	mov	%esi, %edi
> +	pshufd	$0, %xmm6, %xmm6

> +	and	$3, %edi
> +	por	%xmm6, %xmm0
> +	jz	L(aligned4bytes)
> +	cmp	$3, %edi
> +	psrldq	$1, %xmm6
> +	jz	L(aligned4bytes)
> +	cmp	$2, %edi
> +	psrldq	$1, %xmm6
> +	jz	L(aligned4bytes)
> +	psrldq	$1, %xmm6

Is there any way for the value in EDI register to be not a multiple of 4 here?
Both source and destination are guaranteed to be 4 bytes aligned because they 
are pointers to uint32_t.

This seems to be a bit redundant here (it looks like a leftover from 
memcpy/memmove implementation which was used as a templete for this code). The 
negative impact is a slight runtime overhead (especially for short scanlines) 
and bigger code size.

[...]

> L(shl_1):

[...]

> L(shl_1_loop):

[...]

> L(shl_2):

[...]

> L(shl_2_loop):

[...]

The rest of code contains a lot of repeatable patterns (the main loop is 
replicated 16 times for different alignments). IMHO they could be
simplified a lot by using macros, making the code size less scary ;-)

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100813/5b493085/attachment-0001.pgp>


More information about the Pixman mailing list