[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Siarhei Siamashka
siarhei.siamashka at gmail.com
Fri Sep 3 03:28:10 PDT 2010
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote:
> >* Store forwarding
> >
> > - We need some comments in the assembly about the store forwarding
> >
> > that Ma Ling described.
>
> How about this comments added to asm code:
> "CPU doesn't check each address bit for src and dest, so read could fail to
> recognize different address even they are in different page, read
> operation have to force previous write operation to commit data from store
> buffer to cache, the process impact performance seriously. This is work
> around to avoid this cpu limitation."
>
> > - Siarhei's questions about it should be answered, preferably in
> >
> > those comments:
> > So it is basically a store forwarding aliasing problem which is
> > described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64
> > and IA-32 Architectures Optimization Reference Manual", right?
>
> The behavior doesn't belong to store forward which is described in
> "12.3.3.1 Store Forwarding" section from "Intel(r) 64 and IA-32
> Architectures Optimization Reference Manual". That is another HW
> optimization.
Could you elaborate? Preferably with a reference to the relevant section of the
optimization manual. Because it looks like exactly the store forwarding
address aliasing issue to me.
My understanding is that Intel Atom processor has some special logic for fast
handling of a quite common read-after-write memory access pattern (some
operations with the local variables on stack for example). Unfortunately only
the lowest 12 bits of the address are used for the initial selection of this
fast path. And in the case if it turns out to be a false positive later in the
pipeline, there is some performance penalty to handle this situation correctly.
Here is a simple test program which can be used for benchmarking:
/*******************************************************/
.intel_syntax noprefix
.global main
.data
.balign 16
buffer:
.rept 8192
.byte 0
.endr
.text
main:
pusha
lea edi, buffer
#ifdef ALIASING
lea esi, [edi+4096]
#else
lea esi, [edi+4096+64]
#endif
mov ecx, 1660000000 /* 1.66GHz */
jmp 1f
/* main loop */
.balign 16
1:
#ifdef SSE
movd dword ptr [edi], xmm0
movd dword ptr [edi+4], xmm1
movd xmm0, dword ptr [esi]
movd xmm1, dword ptr [esi+4]
#else
mov dword ptr [edi], eax
mov dword ptr [edi+4], ebx
mov eax, dword ptr [esi]
mov ebx, dword ptr [esi+4]
#endif
dec ecx
jnz 1b
popa
ret
/*******************************************************/
$ gcc -m32 -o bench-x86-noaliasing bench-storefw.S
$ gcc -m32 -DALIASING -o bench-x86-aliasing bench-storefw.S
$ gcc -m32 -DSSE -o bench-sse-noaliasing bench-storefw.S
$ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing bench-storefw.S
$ time ./bench-x86-noaliasing
real 0m4.057s
user 0m4.032s
sys 0m0.000s
$ time ./bench-x86-aliasing
real 0m11.059s
user 0m11.041s
sys 0m0.008s
$ time ./bench-sse-noaliasing
real 0m4.048s
user 0m4.036s
sys 0m0.004s
$ time ./bench-sse-aliasing
real 0m4.046s
user 0m4.036s
sys 0m0.004s
So each loop iteration always takes 4 cycles except when using standard x86 mov
instruction with the aliasing of the lowest 12 bits in the address. SSE
instructions movd/movss do not have any aliasing problems.
> > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > this problem? Store forwarding does not seem to be used for SIMD
> > according to the manual. I haven't benchmarked anything yet
> > though.
> >
> > http://lists.cairographics.org/archives/pixman/2010-August/000425.html
I still think this needs to be clarified. Using movd or movss instructions
there has an additional benefit that you can use the same operation on pixel
data in all cases.
Currently you use "por %xmm6, %xmm0" in the main loop and "or 40(%rsi), %ecx"
for the trailing pixels. You could easily use just "por" instruction
everywhere. It's not a big deal here, just inconsistent. With your approach and
more complex compositing operations, you would need to maintain both sse2
and x86 implementations separately for the main loop and for the trailing
pixels. This can and IMHO should be avoided.
--
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100903/b655939a/attachment-0001.pgp>
More information about the Pixman
mailing list