[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Mon Nov 29 10:59:52 PST 2010

On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote:
> Hi, Soeren Sandmann and Siarhei Siamashka:
> 	Glad to send out this refreshed patch to address points we discussed 
> for this SSSE3 patch.

Thanks. And sorry for a rather late reply.

> In this new patch, we merged 32 bit and 64 bit asm code
> into unified code (Macros are moved to header file, and make them more
> reusable and easily to be read).

Actually I'm a bit concerned about how the code looks now. There are still 
similar looking parts in 32-bit and 64-bit implementation which could be
shared:

+/* the macro definition of shift copy in stage one . 32bytes.
+ * Two 16-bytes XMM register will be packed into another 16-byte by using 
palignr
+ * instruction. */
+.macro SHL_COPY_STAGE_ONE x
+       movaps  16(%rsi), %xmm2
+       sub     $32, %rdx
+       movaps  32(%rsi), %xmm3
+       lea     32(%rsi), %rsi
+       movdqa  %xmm3, %xmm4
+       palignr \x, %xmm2, %xmm3
+       lea     32(%rdi), %rdi
+       palignr \x, %xmm1, %xmm2
+       por     %xmm6, %xmm2
+       movaps  %xmm2, -32(%rdi)
+       por     %xmm6, %xmm3
+       movaps  %xmm3, -16(%rdi)
+.endm

and

+/* the macro definition of shift copy in stage one . 32bytes */
+.macro SHL_COPY_STAGE_ONE x
+       movaps  16(%eax), %xmm2
+       sub     $32, %ecx
+       movaps  32(%eax), %xmm3
+       lea     32(%eax), %eax
+       movdqa  %xmm3, %xmm4
+       palignr \x, %xmm2, %xmm3
+       lea     32(%edx), %edx
+       palignr \x, %xmm1, %xmm2
+       por     %xmm6, %xmm2
+       movaps  %xmm2, -32(%edx)
+       por     %xmm6, %xmm3
+       movaps  %xmm3, -16(%edx)
+.endm

The whole point of using macros is to make the code more maintainable, readable
and easy to extend. Just splitting the code in chunks, putting them into macros
with just the primary goal to reduce the total number of lines may actually
have an opposite effect and result in a rather obfuscated sources instead.

I don't see an obvious and clean way how to extend this code to support
more compositing operations.

After all, the implemented code is just only *one* of a huge variety of
compositing operations supported by pixman. And this particular operation
even should not be normally used much on properly written code. Basically, this
is a simple conversion from x8r8g8b8 -> a8r8g8b8 format (by explicitly setting
alpha channel value to 255, instead of having it undefined for x8r8g8b8). If it
is done as an intermediate step somewhere in the rendering pipeline, then it's
just a waste of cpu cycles (no matter if it is SSSE3 optimized or not, it
should preferably be just eliminated completely). If some other compositing
function expects a8r8g8b8 input, but is given x8r8g8b8, then this function can
be trivially extended to also support x8r8g8b8, at worst by adding just a
single OR instruction in it. And at best, it may even become faster after this
change (only 3 color components need to be processed instead of 4). I already
have mentioned it earlier regarding your use case [1]. 

Anyway, if the goal is to seriously improve pixman performance on Intel Atom,
then a lot more different operations need to be optimized too. Another simple
operation which is very similar to 'src_x888_8888' and still somewhat
practically useful is 'add_8888_8888' as mentioned earlier in [2]. Your code
does not seem to be ready to support it without major modifications.

I have attached some simple example of providing unified support for both
32-bit and 64-bit systems and implementing multiple operations. Surely, this
example is missing non-executable stack markers, cfe directives and does not
really utilize SIMD to get good performance. But probably something like this
may be used as a starting point? With all the bells and whistles added,
eventually it might start to resemble current pixman ARM NEON code a lot. 

That said, I actually don't want to advocate any particular way of implementing
this stuff (prefer dynamic JIT code generation or assembly macros, AT&T or
Intel syntax, etc). It's just my vision about what features it should 
eventually have. But if your code works correctly, does not add maintenance
headaches (the part responsible for interacting with the rest of pixman must
be clean) and provides some measurable performance improvement, I would be fine
with it too. After all, it could be always safely scrapped and replaced with
something better if needed. But more about this later.

> We also fixed issue like CPU detection/git log/make file checking/file
> name...etc.

That would be nice to have as a separate patch. It has much higher chances
to be accepted and committed first.

> For MOVD, we simplified the backward copy code since pervious code is too
> long and not gain obvious performance,

And this is what I'm worried about. First you proposed this part of code
(backwards copy) without providing any initial explanation or benchmark
numbers. And now you abandon it, also without providing any benchmark
numbers or description of your test. But because you still have x86
instructions there instead of MOVD, store forwarding aliasing surely may
affect performance, as demonstrated by my simple test program posted
earlier [3] as part of the discussion. 

Can we be sure that this SSSE3 patch actually provides any practical 
performance improvement over the current pixman code? Considering that
compared to your initial profiling data, now SSE2 optimized code is used
for this operation and also problematic software prefetch which was almost
halving [4] effective memory bandwidth for this operation is now
eliminated [5].

Surely, using 'test/lowlevel-blt-bench' microbenchmark it showed good
improvement for L1 cached data. But on the other hand, performance of small 
random operations has dropped (shown as 'R' statistics). Maybe that's happening
because of the scanline function call overhead and this fwd/bwd special 
handling for small buffer sizes? So a final verification with some real
practical use case would be interesting.

As for the pixman code which is really worth optimizing on Intel Atom,
it is 'over_8888_8888/over_8888_x8888' operation (function 
'sse2_composite_over_8888_8888'). This is one of the most commonly
used functions. And the microbenchmark numbers measured in millions
of pixels per second when processing data in L1 cache, in L2 cache and
in memory by using 'lowlevel-blt-bench' test program are the following:

== Intel Atom N450 @1667MHz, DDR2-667 (64-bit) ==

           add_8888_8888 =  L1: 607.08  L2: 375.34  M:259.53
          over_8888_x888 =  L1: 123.73  L2: 117.10  M:113.56
          over_8888_0565 =  L1: 106.11  L2:  98.91  M: 99.07

== TI OMAP3430/3530, ARM Cortex-A8 @500MHz, LPDDR @166MHz (32-bit) ==

    default build:
           add_8888_8888 =  L1: 227.26  L2:  84.71  M: 44.54
          over_8888_x888 =  L1: 161.06  L2:  88.20  M: 44.86
          over_8888_0565 =  L1: 127.02  L2:  93.99  M: 61.25

    software prefetch disabled (*):
           add_8888_8888 =  L1: 351.44  L2:  97.29  M: 25.35
          over_8888_x888 =  L1: 168.72  L2:  95.04  M: 24.81
          over_8888_0565 =  L1: 128.06  L2:  98.96  M: 32.16

    (*) ARM code uses relatively complex cross-scanline software prefetch
        intended to maximize the use of the available memory bandwidth (it's
        normally expected to be about as fast as the best available memcpy
        implementation, though it was tuned for early OMAP3 chips and may need
        to be updated for optimal performance on the other devices). This code
        is typically executed in a separate pipeline simultaneously with NEON
        instructions and has practically no overhead, but still becomes visible
        for extremely simple operations such as 'add'.

It is interesting here that 'add_8888_8888' and 'over_8888_x888' are accessing
memory in exactly the same way (read 32-bit pixel from both source and
destination, write result to destination), while 'add' is much less
computationally intensive than 'over'.

On ARM, both 'add' and 'over' operations perform exactly the same when working
with the data in memory, which shows that they are memory bandwidth limited.
While on Atom 'over' is clearly limited by CPU performance and this is
something which can be possibly improved. Surely Atom still wins against ARM in
absolute numbers for 'over' operation, but it probably can do a lot better than
this.

Another note: this microbenchmark also shows that there may be little practical
gain in optimizing 'add_8888_8888' operation as it is already memory bandwidth
limited.

1. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00357.html
2. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00378.html
3. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00415.html
4. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00404.html
5. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00563.html

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
.intel_syntax noprefix
.altmacro

/* x86/x86-64 abstraction (linux only) */
#ifdef __amd64__
    .macro ENTRANCE
	.set DST_PTR, rdi
	.set SRC_PTR, rsi
	.set PIXEL_COUNTER, rdx
    .endm
#else
    .macro ENTRANCE
	mov	eax, [esp + 4]
	mov	ecx, [esp + 8]
	mov	edx, [esp + 12]
	.set DST_PTR, eax
	.set SRC_PTR, ecx
	.set PIXEL_COUNTER, edx
    .endm
#endif

/* scanline function generator macro */
.macro generate_scanline_function function_name, prepare, composite, need_dst=0
    .global function_name
    function_name:
	ENTRANCE

	lea	SRC_PTR, [SRC_PTR + PIXEL_COUNTER * 4]
	lea	DST_PTR, [DST_PTR + PIXEL_COUNTER * 4]
	neg	PIXEL_COUNTER

	prepare
    1:
	/* load source pixel */
	movd	xmm0, dword ptr [SRC_PTR + PIXEL_COUNTER * 4]
	/* load destination pixel if needed */
	.if need_dst
	movd	xmm1, dword ptr [DST_PTR + PIXEL_COUNTER * 4]
	.endif

	/* do something */
	composite

	/* write back the result */
	movd	dword ptr [DST_PTR + PIXEL_COUNTER * 4], xmm0

	add	PIXEL_COUNTER, 1
	js	1b

	ret
.endm

/*
 * instantiate "void simple_copy32_or(int *dst, int *src, int number_of_pixels)" function
 * (performs '*dst = *src | 0xFF000000' operation on pixels)
 */

cff000000:
.long	0xFF000000
.macro prepare_simple_copy32_or
	movd	xmm2, cff000000
.endm
.macro composite_simple_copy32_or
	por	xmm0, xmm2
.endm
generate_scanline_function simple_copy32_or, prepare_simple_copy32_or, composite_simple_copy32_or

/*
 * instantiate "void simple_copy32_add(int *dst, int *src, int number_of_pixels)" function
 * (performs '*dst = unsigned_saturated_8bit_vector_add(*dst, *src)' operation on pixels)
 */

.macro prepare_simple_copy32_add
.endm
.macro composite_simple_copy32_add
	paddusb xmm0, xmm1
.endm
generate_scanline_function simple_copy32_add, prepare_simple_copy32_add, composite_simple_copy32_add, 1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20101129/c40abba6/attachment.pgp>