[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
siarhei.siamashka at gmail.com
Fri Sep 3 03:28:10 PDT 2010
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote:
> >* Store forwarding
> > - We need some comments in the assembly about the store forwarding
> > that Ma Ling described.
> How about this comments added to asm code:
> "CPU doesn't check each address bit for src and dest, so read could fail to
> recognize different address even they are in different page, read
> operation have to force previous write operation to commit data from store
> buffer to cache, the process impact performance seriously. This is work
> around to avoid this cpu limitation."
> > - Siarhei's questions about it should be answered, preferably in
> > those comments:
> > So it is basically a store forwarding aliasing problem which is
> > described in "188.8.131.52 Store Forwarding" section from "Intel(r) 64
> > and IA-32 Architectures Optimization Reference Manual", right?
> The behavior doesn't belong to store forward which is described in
> "184.108.40.206 Store Forwarding" section from "Intel(r) 64 and IA-32
> Architectures Optimization Reference Manual". That is another HW
Could you elaborate? Preferably with a reference to the relevant section of the
optimization manual. Because it looks like exactly the store forwarding
address aliasing issue to me.
My understanding is that Intel Atom processor has some special logic for fast
handling of a quite common read-after-write memory access pattern (some
operations with the local variables on stack for example). Unfortunately only
the lowest 12 bits of the address are used for the initial selection of this
fast path. And in the case if it turns out to be a false positive later in the
pipeline, there is some performance penalty to handle this situation correctly.
Here is a simple test program which can be used for benchmarking:
lea edi, buffer
lea esi, [edi+4096]
lea esi, [edi+4096+64]
mov ecx, 1660000000 /* 1.66GHz */
/* main loop */
movd dword ptr [edi], xmm0
movd dword ptr [edi+4], xmm1
movd xmm0, dword ptr [esi]
movd xmm1, dword ptr [esi+4]
mov dword ptr [edi], eax
mov dword ptr [edi+4], ebx
mov eax, dword ptr [esi]
mov ebx, dword ptr [esi+4]
$ gcc -m32 -o bench-x86-noaliasing bench-storefw.S
$ gcc -m32 -DALIASING -o bench-x86-aliasing bench-storefw.S
$ gcc -m32 -DSSE -o bench-sse-noaliasing bench-storefw.S
$ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing bench-storefw.S
$ time ./bench-x86-noaliasing
$ time ./bench-x86-aliasing
$ time ./bench-sse-noaliasing
$ time ./bench-sse-aliasing
So each loop iteration always takes 4 cycles except when using standard x86 mov
instruction with the aliasing of the lowest 12 bits in the address. SSE
instructions movd/movss do not have any aliasing problems.
> > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > this problem? Store forwarding does not seem to be used for SIMD
> > according to the manual. I haven't benchmarked anything yet
> > though.
> > http://lists.cairographics.org/archives/pixman/2010-August/000425.html
I still think this needs to be clarified. Using movd or movss instructions
there has an additional benefit that you can use the same operation on pixel
data in all cases.
Currently you use "por %xmm6, %xmm0" in the main loop and "or 40(%rsi), %ecx"
for the trailing pixels. You could easily use just "por" instruction
everywhere. It's not a big deal here, just inconsistent. With your approach and
more complex compositing operations, you would need to maintain both sse2
and x86 implementations separately for the main loop and for the trailing
pixels. This can and IMHO should be avoided.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: This is a digitally signed message part.
More information about the Pixman