[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Mon Aug 16 01:24:44 PDT 2010

Hi, Siamashka:
	Thanks for kindly comments! It is very nice that bug#20709 also emphasize similar performance issue.
	So, let's discuss how to make this patch better. Here is some answers from me, and Ma Ling will update comments on 32/64 bit question and assemble details :

1) For MOVAPS+PALIGNR from SSSE3, referring chapter 12 of <Intel(r) 64 and IA-32 Architectures Optimization Reference Manual> at http://www.intel.com/Assets/PDF/manual/248966.pdf, we can know MOVAPS+PALIGNR is preferred in ATOM: "Assembly/Compiler Coding Rule 13. (MH impact, M generality) For Intel Atom processors, prefer a sequence MOVAPS+PALIGN over MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over MOVDQU." As far as I know, this code pattern is friendly for ATOM and won't harm Core-i7.
2) For #ifdefs vs. dynamically checking SSSE3, the patch exactly mimics "SSE2" support in currently pixman code(#ifdefs), which checks the build system when makefile generating and applies USE_SSSE3 into later building. Of course prefix of CPUID can achieve better compatibility. It is ok for us to add this kind of CPUID prefix. I will appreciate directly comments on shape of this patch, using SSSE3, e.g. some code sample of expected pixman-cpu.c and pixman-access.c.
3) For executable stack caused by assemble file, it should be able to solved by adding ".section .note.GNU-stack .previous", according https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.html. In fact, same solution already exist in current pixman-arm-neon-asm.S inside pixman code.

Thanks!
Samuel

-----Original Message-----
From: Siarhei Siamashka [mailto:siarhei.siamashka at gmail.com] 
Sent: Saturday, August 14, 2010 1:54 AM
To: pixman at lists.freedesktop.org
Cc: Liu, Xinyun; Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote:
> Hi,
> 
> piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or"
> operations. With ssse3_memcpy, the performance is increased a little.
> 
> Reference: http://bugs.meego.com/show_bug.cgi?id=5012
> 
> Quote:
> > After optimization, we can find following speed up:
> > 1.    fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run.
> > It now consumes 0.8% of whole system, comparing with 2.3% w/o
> > optimization. 2.    system C0 is reduced around 1.5%
> > 3.    Global CPU utilization is reduced around 1.5% of whole system
> 
> Here is patch, please review it, thanks!

Thanks for the patch. This performance problem has been known for some time already and there is a bug about it (with some useful comments):
https://bugs.freedesktop.org/show_bug.cgi?id=20709

Out of curiosity, how much of the performance gain is provided by using 
MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via 
MOVAPS+MOVUPS
on Intel Atom? Is it also beneficial on other processors supporting SSSE3 such as Intel Core i7?

That said, I surely appreciate the intent of getting the absolutely best performance whenever it is possible. I also tried to find the best possible memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize the code to support many types of different compositing operations (in 'pixman-arm-neon-asm.*').

Now back to your patch. It is not a compete review, I just wanted to comment a few things which stick out at the moment.

As I see, the biggest problems in it are:

1. This patch introduces pixman build failure on 64-bit systems, because 'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit assumptions.

2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. 
Right now, applying this patch will just make default pixman build unusable on the processors which do not support SSSE3.

3. Linking 'pixman-access-ssse3.o' object file with the others enforces the use of executable stack, which is a bad thing. More details can be found here for example:
http://www.gentoo.org/proj/en/hardened/gnu-stack.xml

Less important things spotted by reading the code are below.

> From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001
> From: Liu Xinyun <xinyun.liu at intel.com>
> Date: Wed, 11 Aug 2010 16:31:47 +0800
> Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 
> instruction
> 
> Signed-off-by: Liu Xinyun <xinyun.liu at intel.com>
> Signed-off-by: Xu, Samuel <samuel.xu at intel.com>
> Signed-off-by: Ma, ling <ling.ma at intel.com>
> ---
>  configure.ac                 |   56 +++
>  pixman/Makefile.am           |   14 +
>  pixman/pixman-access-ssse3.S | 1119
> ++++++++++++++++++++++++++++++++++++++++++ pixman/pixman-access.c       | 
>   6 +
>  4 files changed, 1195 insertions(+), 0 deletions(-)  create mode 
> 100644 pixman/pixman-access-ssse3.S

[...]

> +	.section .text.ssse3,"ax", at progbits
> +ENTRY (MEMCPY_OR)
> +	ENTRANCE
> +	movl	LEN(%esp), %ecx
> +	movl	SRC(%esp), %eax
> +	movl	DEST(%esp), %edx
> +
> +	cmp	$48, %ecx
> +	jae	L(48bytesormore)
> +
> +	cmp	%dl, %al
> +	jb	L(bk_write)
> +	add	%ecx, %edx
> +	add	%ecx, %eax
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %ecx, 4)
> +L(bk_write):
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %ecx, 4)

Backwards copy is not strictly needed for 'fetch_scanline_x8r8g8b8' at the moment. And probably will never be because scanlines get fetched into a temporary buffer.

> +
> +	ALIGN (4)
> +/* ECX > 32 and EDX is 4 byte aligned.  */
> +L(48bytesormore):
> +	movdqu	(%eax), %xmm0
> +	PUSH (%edi)
> +	mov	$0xff000000, %edi
> +	movd	%edi, %xmm6
> +	movl	%edx, %edi
> +	and	$-16, %edx
> +	PUSH (%esi)
> +	add	$16, %edx
> +	movl	%edi, %esi
> +	sub	%edx, %edi
> +	add	%edi, %ecx
> +	sub	%edi, %eax
> +
> +	mov	%esi, %edi
> +	pshufd	$0, %xmm6, %xmm6

> +	and	$3, %edi
> +	por	%xmm6, %xmm0
> +	jz	L(aligned4bytes)
> +	cmp	$3, %edi
> +	psrldq	$1, %xmm6
> +	jz	L(aligned4bytes)
> +	cmp	$2, %edi
> +	psrldq	$1, %xmm6
> +	jz	L(aligned4bytes)
> +	psrldq	$1, %xmm6

Is there any way for the value in EDI register to be not a multiple of 4 here?
Both source and destination are guaranteed to be 4 bytes aligned because they are pointers to uint32_t.

This seems to be a bit redundant here (it looks like a leftover from memcpy/memmove implementation which was used as a templete for this code). The negative impact is a slight runtime overhead (especially for short scanlines) and bigger code size.

[...]

> L(shl_1):

[...]

> L(shl_1_loop):

[...]

> L(shl_2):

[...]

> L(shl_2_loop):

[...]

The rest of code contains a lot of repeatable patterns (the main loop is replicated 16 times for different alignments). IMHO they could be simplified a lot by using macros, making the code size less scary ;-)

--
Best regards,
Siarhei Siamashka