[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Tue Sep 7 04:03:52 PDT 2010

Hi Siarhei Siamashka

> Could you elaborate? Preferably with a reference to the relevant section of the
> optimization manual. Because it looks like exactly the store forwarding address
> aliasing issue to me.
[Ma Ling]:Currently it is not described in our optimization manual, soon it will be published in new version.

> My understanding is that Intel Atom processor has some special logic for fast
> handling of a quite common read-after-write memory access pattern (some
> operations with the local variables on stack for example). Unfortunately only
> the lowest 12 bits of the address are used for the initial selection of this fast
> path. And in the case if it turns out to be a false positive later in the pipeline,
> there is some performance penalty to handle this situation correctly.

[Ma Ling]:Intel have this HW optimization is store-forward as you mentioned, but there are some 
constraints about it, the future doesn't cover the cases we described before.

> Here is a simple test program which can be used for benchmarking:
> 
> /*******************************************************/
> .intel_syntax noprefix
> .global main
> 
> .data
> 
> .balign 16
> buffer:
> .rept 8192
> .byte 0
> .endr
> 
> .text
> 
> main:
>     pusha
> 
>     lea    edi, buffer
> #ifdef ALIASING
>     lea    esi, [edi+4096]
> #else
>     lea    esi, [edi+4096+64]
> #endif
>     mov    ecx, 1660000000  /* 1.66GHz */
>     jmp    1f
> 
>     /* main loop */
>     .balign 16
> 1:
> #ifdef SSE
>     movd   dword ptr [edi], xmm0
>     movd   dword ptr [edi+4], xmm1
>     movd   xmm0, dword ptr [esi]
>     movd   xmm1, dword ptr [esi+4]
> #else
>     mov    dword ptr [edi], eax
>     mov    dword ptr [edi+4], ebx
>     mov    eax, dword ptr [esi]
>     mov    ebx, dword ptr [esi+4]
> #endif
>     dec    ecx
>     jnz    1b
> 
>     popa
>     ret
> /*******************************************************/
> 
> $ gcc -m32 -o bench-x86-noaliasing bench-storefw.S $ gcc -m32 -DALIASING -o
> bench-x86-aliasing bench-storefw.S $ gcc -m32 -DSSE -o bench-sse-noaliasing
> bench-storefw.S $ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing
> bench-storefw.S
> 
> $ time ./bench-x86-noaliasing
> 
> real    0m4.057s
> user    0m4.032s
> sys     0m0.000s
> 
> $ time ./bench-x86-aliasing
> 
> real    0m11.059s
> user    0m11.041s
> sys     0m0.008s
> 
> $ time ./bench-sse-noaliasing
> 
> real    0m4.048s
> user    0m4.036s
> sys     0m0.004s
> 
> $ time ./bench-sse-aliasing
> 
> real    0m4.046s
> user    0m4.036s
> sys     0m0.004s
> 
> So each loop iteration always takes 4 cycles except when using standard x86
> mov instruction with the aliasing of the lowest 12 bits in the address. SSE
> instructions movd/movss do not have any aliasing problems.
> 
> > >     Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > >     this problem?  Store forwarding does not seem to be used for SIMD
> > >     according to the manual. I haven't benchmarked anything yet
> > >     though.
> > >
> > >
> > > http://lists.cairographics.org/archives/pixman/2010-August/000425.ht
> > > ml
> 
> I still think this needs to be clarified. Using movd or movss instructions there
> has an additional benefit that you can use the same operation on pixel data in
> all cases.

[Ma Ling] movd is not good for Atom because of it will use AGU-
" * Integer-FP/SIMD transfer: Instructions that transfer integer data to the
FP/SIMD side of the machine also uses AGU. Examples of these instructions
include MOVD, PINSRW. If one of the source register of these instructions
depends on the result of an execution unit, this dependency will also cause a
delay of 3 cycles." 12.3.2.2 Address Generation

Best Regards
Ling