[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Fri Sep 3 03:28:10 PDT 2010

On Friday 03 September 2010 11:53:47 Xu, Samuel wrote:
> >* Store forwarding
> >
> >  - We need some comments in the assembly about the store forwarding
> >  
> >    that Ma Ling described.
> 
> How about this comments added to asm code:
> "CPU doesn't check each address bit for src and dest, so read could fail to
> recognize different address even they are in different page, read
> operation have to force previous write operation to commit data from store
> buffer to cache, the process impact performance seriously. This is work
> around to avoid this cpu limitation."
> 
> >  - Siarhei's questions about it should be answered, preferably in
> >  
> >    those comments:
> >     So it is basically a store forwarding aliasing problem which is
> >     described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64
> >     and IA-32 Architectures Optimization Reference Manual", right?
> 
> The behavior doesn't belong to store forward which is described in
> "12.3.3.1 Store Forwarding" section from "Intel(r) 64 and IA-32
> Architectures Optimization Reference Manual". That is another HW
> optimization.

Could you elaborate? Preferably with a reference to the relevant section of the
optimization manual. Because it looks like exactly the store forwarding
address aliasing issue to me.

My understanding is that Intel Atom processor has some special logic for fast 
handling of a quite common read-after-write memory access pattern (some 
operations with the local variables on stack for example). Unfortunately only 
the lowest 12 bits of the address are used for the initial selection of this 
fast path. And in the case if it turns out to be a false positive later in the 
pipeline, there is some performance penalty to handle this situation correctly.

Here is a simple test program which can be used for benchmarking:

/*******************************************************/
.intel_syntax noprefix
.global main

.data

.balign 16
buffer:
.rept 8192
.byte 0
.endr

.text

main:
    pusha

    lea    edi, buffer
#ifdef ALIASING
    lea    esi, [edi+4096]
#else
    lea    esi, [edi+4096+64]
#endif
    mov    ecx, 1660000000  /* 1.66GHz */
    jmp    1f

    /* main loop */
    .balign 16
1:
#ifdef SSE
    movd   dword ptr [edi], xmm0
    movd   dword ptr [edi+4], xmm1
    movd   xmm0, dword ptr [esi]
    movd   xmm1, dword ptr [esi+4]
#else
    mov    dword ptr [edi], eax
    mov    dword ptr [edi+4], ebx
    mov    eax, dword ptr [esi]
    mov    ebx, dword ptr [esi+4]
#endif
    dec    ecx
    jnz    1b

    popa
    ret
/*******************************************************/

$ gcc -m32 -o bench-x86-noaliasing bench-storefw.S 
$ gcc -m32 -DALIASING -o bench-x86-aliasing bench-storefw.S 
$ gcc -m32 -DSSE -o bench-sse-noaliasing bench-storefw.S 
$ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing bench-storefw.S 

$ time ./bench-x86-noaliasing 

real    0m4.057s
user    0m4.032s
sys     0m0.000s

$ time ./bench-x86-aliasing 

real    0m11.059s
user    0m11.041s
sys     0m0.008s

$ time ./bench-sse-noaliasing 

real    0m4.048s
user    0m4.036s
sys     0m0.004s

$ time ./bench-sse-aliasing 

real    0m4.046s
user    0m4.036s
sys     0m0.004s

So each loop iteration always takes 4 cycles except when using standard x86 mov 
instruction with the aliasing of the lowest 12 bits in the address. SSE 
instructions movd/movss do not have any aliasing problems.

> >     Wouldn't just the use of MOVD/MOVSS instructions here also solve
> >     this problem?  Store forwarding does not seem to be used for SIMD
> >     according to the manual. I haven't benchmarked anything yet
> >     though.
> >     
> >    http://lists.cairographics.org/archives/pixman/2010-August/000425.html

I still think this needs to be clarified. Using movd or movss instructions
there has an additional benefit that you can use the same operation on pixel
data in all cases.

Currently you use "por %xmm6, %xmm0" in the main loop and "or 40(%rsi), %ecx" 
for the trailing pixels. You could easily use just "por" instruction
everywhere. It's not a big deal here, just inconsistent. With your approach and 
more complex compositing operations, you would need to maintain both sse2
and x86 implementations separately for the main loop and for the trailing
pixels. This can and IMHO should be avoided.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100903/b655939a/attachment-0001.pgp>