[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Ma, Ling
ling.ma at intel.com
Tue Sep 7 04:03:52 PDT 2010
Hi Siarhei Siamashka
> Could you elaborate? Preferably with a reference to the relevant section of the
> optimization manual. Because it looks like exactly the store forwarding address
> aliasing issue to me.
[Ma Ling]:Currently it is not described in our optimization manual, soon it will be published in new version.
> My understanding is that Intel Atom processor has some special logic for fast
> handling of a quite common read-after-write memory access pattern (some
> operations with the local variables on stack for example). Unfortunately only
> the lowest 12 bits of the address are used for the initial selection of this fast
> path. And in the case if it turns out to be a false positive later in the pipeline,
> there is some performance penalty to handle this situation correctly.
[Ma Ling]:Intel have this HW optimization is store-forward as you mentioned, but there are some
constraints about it, the future doesn't cover the cases we described before.
> Here is a simple test program which can be used for benchmarking:
>
> /*******************************************************/
> .intel_syntax noprefix
> .global main
>
> .data
>
> .balign 16
> buffer:
> .rept 8192
> .byte 0
> .endr
>
> .text
>
> main:
> pusha
>
> lea edi, buffer
> #ifdef ALIASING
> lea esi, [edi+4096]
> #else
> lea esi, [edi+4096+64]
> #endif
> mov ecx, 1660000000 /* 1.66GHz */
> jmp 1f
>
> /* main loop */
> .balign 16
> 1:
> #ifdef SSE
> movd dword ptr [edi], xmm0
> movd dword ptr [edi+4], xmm1
> movd xmm0, dword ptr [esi]
> movd xmm1, dword ptr [esi+4]
> #else
> mov dword ptr [edi], eax
> mov dword ptr [edi+4], ebx
> mov eax, dword ptr [esi]
> mov ebx, dword ptr [esi+4]
> #endif
> dec ecx
> jnz 1b
>
> popa
> ret
> /*******************************************************/
>
> $ gcc -m32 -o bench-x86-noaliasing bench-storefw.S $ gcc -m32 -DALIASING -o
> bench-x86-aliasing bench-storefw.S $ gcc -m32 -DSSE -o bench-sse-noaliasing
> bench-storefw.S $ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing
> bench-storefw.S
>
> $ time ./bench-x86-noaliasing
>
> real 0m4.057s
> user 0m4.032s
> sys 0m0.000s
>
> $ time ./bench-x86-aliasing
>
> real 0m11.059s
> user 0m11.041s
> sys 0m0.008s
>
> $ time ./bench-sse-noaliasing
>
> real 0m4.048s
> user 0m4.036s
> sys 0m0.004s
>
> $ time ./bench-sse-aliasing
>
> real 0m4.046s
> user 0m4.036s
> sys 0m0.004s
>
> So each loop iteration always takes 4 cycles except when using standard x86
> mov instruction with the aliasing of the lowest 12 bits in the address. SSE
> instructions movd/movss do not have any aliasing problems.
>
> > > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > this problem? Store forwarding does not seem to be used for SIMD
> > > according to the manual. I haven't benchmarked anything yet
> > > though.
> > >
> > >
> > > http://lists.cairographics.org/archives/pixman/2010-August/000425.ht
> > > ml
>
> I still think this needs to be clarified. Using movd or movss instructions there
> has an additional benefit that you can use the same operation on pixel data in
> all cases.
[Ma Ling] movd is not good for Atom because of it will use AGU-
" * Integer-FP/SIMD transfer: Instructions that transfer integer data to the
FP/SIMD side of the machine also uses AGU. Examples of these instructions
include MOVD, PINSRW. If one of the source register of these instructions
depends on the result of an execution unit, this dependency will also cause a
delay of 3 cycles." 12.3.2.2 Address Generation
Best Regards
Ling
More information about the Pixman
mailing list