[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Tue Sep 7 12:39:29 PDT 2010

On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote:
> > > >     Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > >     this problem?  Store forwarding does not seem to be used for SIMD
> > > >     according to the manual. I haven't benchmarked anything yet
> > > >     though.

> [Ma Ling] movd is not good for Atom because of it will use AGU-
> " * Integer-FP/SIMD transfer: Instructions that transfer integer data to
> the FP/SIMD side of the machine also uses AGU. Examples of these
> instructions include MOVD, PINSRW. If one of the source register of these
> instructions depends on the result of an execution unit, this dependency
> will also cause a delay of 3 cycles." 12.3.2.2 Address Generation

Well, a simple benchmark shows that MOVD performs at least as good as MOV if we
compare the following loops:

 80483e0:       66 0f 6e 06             movd   (%esi),%xmm0
 80483e4:       66 0f eb c1             por    %xmm1,%xmm0
 80483e8:       66 0f 7e 07             movd   %xmm0,(%edi)
 80483ec:       49                      dec    %ecx
 80483ed:       75 f1                   jne    80483e0 <main+0x20>

vs.

 80483e0:       8b 06                   mov    (%esi),%eax
 80483e2:       09 d8                   or     %ebx,%eax
 80483e4:       89 07                   mov    %eax,(%edi)
 80483e6:       49                      dec    %ecx
 80483e7:       75 f7                   jne    80483e0 <main+0x20>

Both of these loops use 3 cycles per iteration as can be easily measured by
a simple modification of the benchmark program that I posted earlier.
The only advantage of non-SSE code here is that it is smaller. But you are
anyway killing this code size advantage by keeping both forward and backwards
copy variants.

The part from the optimization manual which you quoted is related to moving
data between x86 and SSE registers, which is indeed a bad idea. But there is no
need doing anything like this. You can just load data directly from memory to
SSE registers, do some operations with it (using SSE instructions) and store
the result to memory.

> > Your code still can be simplified a lot. I'm just not quite sure whether it
> > would be more practical to commit something first and then refactor it with
> > the follow up commits. Or attempt to make a "perfect" patch before
> > committing.

> [Ma Ling] Yes, I agree with you, let us commit it first, then strength it,
> such as appending non-temporary instructions for large data copy which is
> over L1 cache size.

Actually it's not up to me to decide. Much more important is whether
Søren Sandmann agrees to this or not. I'm just afraid that trying to reach
perfection, we may get stuck with no further progress at some point. Because
for example some of the things, which are clear and simple for me, may be too
unusual or unfamiliar for you. Or possibly the other way around.

But in any case, the other issues pointed by Søren still can and need to be
addressed.

Just one last question to clarify things. Did you write this SSSE3 assembly
code yourself or was the output of some C compiler (at least partially) used
for it?

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100907/ad527cef/attachment.pgp>