[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Thu Aug 26 19:59:00 PDT 2010

Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!
It address following suggestions:
1: SSSE3 file is split to a new file. Comparing with to duplicate every content from SSE2 file, I added a way to merge new fast path function pointer table runtimely by CPUID check, to avoid huge code expansion and a bug you pointed out.(Without runtime merging, I have to duplicate every line from SSE2 file to SSSE3 file, it is really not graceful and redundant.
2: Copy right information added
3: Makefile fix
4: author name fix

For detect_cpu_features(), we possibly can remove the TWO lines asm code to copy edx to "result", to avoid SSE2 and MMX checking, since most preparing code still need. If we keep this 2 lines, it still give us some possibility to check other edx information of CPUID in the future. It is only executed once when pixman is loaded.

For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of "why we use backward copy mode in the patch ", from Ling:
All read operations after allocation stage can run speculatively, all write operation will run in program order, and if addresses are different read may run before older write operation, otherwise wait until write commit. However CPU don't check each address bit, so read could fail to recognize different address even theyare in different page.
For example if rsi is 0xf004, rdi is 0xe008,in following operation there will generate big performance latency.
1. movq (%rsi),	%rax
2. movq %rax,		(%rdi)
3. movq 8(%rsi),	%rax
4. movq %rax,		8(%rdi)

If %rsi and rdi were in really the same meory page, there are TRUE read-after-write dependence because instruction 2 write 0x008 and instruction 3 read 0x00c, the two address are overlap partially. Actually there are in different page and no any issues, but without checking each address bit CPU could think they are in the same page, and instruction 3 have to wait for instruction 2 to write data into cache from write buffer, then load data from cache, the cost time read spent is equal to mfence instruction. 
We may avoid it by tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax,	8(%rdi)
3. movq (%rsi),	%rax
4. movq %rax,	(%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any dependence.
In this patch we first handle small size(less 48bytes) by this way, and the approach help us improve up to 2X on Atom.

Regards,
Liu, Xinyun

-----Original Message-----
From: Siarhei Siamashka [mailto:siarhei.siamashka at gmail.com] 
Sent: Sunday, August 22, 2010 1:49 AM
To: Liu, Xinyun
Cc: pixman at lists.freedesktop.org; Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote:
> Hi Siarhei Siamashka,
> 
> Here is a new patch, can you review it? Thank you!

Sure, thanks for the updated patch. Some comments follow.

> From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001
> From: root <root at d501.localdomain>

A minor nitpick here. The 'From:' header seems to be wrong, it does not look like a real patch author name.

> +dnl Check for SSSE3
> +
> +if test "x$SSSE3_CFLAGS" = "x" ; then
> +   if test "x$SUNCC" = "xyes"; then
> +      # SSSE3 is enabled by default in the Sun Studio 64-bit 
> +environment

Is it a valid statement with the regards to Sun Studio? Looks like it is a direct copy/paste from SSE2. Can we be sure that there is nothing redundant or just plain wrong here? 

> +#if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && 
> +__GNUC_MINOR__ <
3))
> +#   if !defined(__amd64__) && !defined(__x86_64__)

Is this 'defined(__amd64__)' check needed here? Looks like it is again a copy/paste from SSE2 code, but SSE2 is guaranteed to be supported for x86-64 systems.

> +#      error "Need GCC >= 4.3 for SSSE3 intrinsics on x86"
> +#   endif
> +#endif

> diff --git a/pixman/pixman-access-ssse3_x86-64.S 
> b/pixman/pixman-access-
ssse3_x86-64.S
> new file mode 100644
> index 0000000..fec93df
> --- /dev/null
> +++ b/pixman/pixman-access-ssse3_x86-64.S

Copyright notice seems to be still missing in the newly added source files.

> +	cmp	%dil, %sil
> +	jb	L(bk_write)
> +	add	%rdx, %rsi
> +	add	%rdx, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4)
> +L(bk_write):
> +	BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %rdx, 4)

Would it make sense to also purge this 'bk_write' and 'table_48_bytes_bwd' 
stuff?

> diff --git a/pixman/pixman-cpu.c b/pixman/pixman-cpu.c index 
> e4fb1e4..ab2d69a 100644

[...]

> +#else         /* end dt_cpu32() */
> +
> +static unsigned int
> +detect_cpu_features (void)
> +{
> +    unsigned int features = 0;
> +    unsigned int result = 0;
> +    unsigned int result_c = 0;

This looks like a duplication of the code to make a new 64-bit version of 'detect_cpu_features' function. Could we assume that all 64-bit processors support CPUID instruction (so that the check for CPUID support can be omitted), and also support MMX and SSE2? In this case SSSE3 check will be just a tiny block of inline assembly using CPUID instruction.

> diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c index 
> 3dd7967..0cfd554 100644
> --- a/pixman/pixman-sse2.c
> +++ b/pixman/pixman-sse2.c
> @@ -3520,6 +3520,56 @@ sse2_composite_over_8888_n_8888
(pixman_implementation_t *imp,
>  /*---------------------------------------------------------------------
>   * composite_over_8888_n_8888
>   */
> +#if defined(USE_SSSE3) && !defined(_MSC_VER)
> +
> +extern void *composite_src_x888_8888_ssse3_fast_path( uint32_t * 
> +dest,
uint32_t *src,  int32_t count);
> +extern pixman_bool_t choose_ssse3_fast_patch; static void
> +ssse3_composite_src_x888_8888 (pixman_implementation_t *imp,

The addition of 'ssse3_composite_src_x888_8888' into 'pixman-sse2.c' does not look like a good idea. Adding a new 'pixman-ssse3.c' file for SSSE3 optimizations would have been a better choice.

Especially considering the following part of code:

> +#if defined(USE_SSSE3) && !defined(_MSC_VER)
> +    PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8,
ssse3_composite_src_x888_8888),
> +    PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8,
ssse3_composite_src_x888_8888),
> +#else
>      PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8,
sse2_composite_src_x888_8888),
>      PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8,
sse2_composite_src_x888_8888),
> +#endif

By using "#if defined(USE_SSSE3)" here, you effectively disable 'sse2_composite_src_x888_8888' fast path for the processors which do not have
SSSE3 support. If the runtime detection of SSSE3 fails, the code fallbacks to C implementation and we have no chance to benefit from SSE2 optimizations.

--
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Add-ssse3_composite_src_x888_8888.patch
Type: application/octet-stream
Size: 45419 bytes
Desc: 0001-Add-ssse3_composite_src_x888_8888.patch
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100827/f6b95a5d/attachment-0001.obj>