[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Xu, Samuel samuel.xu at intel.com
Tue Sep 7 18:44:07 PDT 2010


Hi, Soeren Sandmann and Siarhei Siamashka:

As a wrap of current discussion, combining you two's comments, can we assume this new patch of SSSE3 is ok?
New patch might contains:
1. Fix 64 bit CPU detection issue for MMX and SSE2
2. Add more comments for git commit log
3. change SSSE3 intrinsics check to SSSE3 asm check in makefile
4. remove #include "pixman-combine32.h" and composite_over_8888_n_8888in pixman-ssse3.c
5. ASM files changes:
	1) change asm file name to pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S
	2) change the asm function name to "composite_line_src_x888_8888_ssse3"
	3) remove "defined(_M_AMD64))"
	4) Comments in the assembly about the store forwarding
We know there are some discussion on asm code, e.g. MOVD, unify 32 bit and 64 bit code. While it won't introduce real defection. We still can put further change in next wave, to avoid current sticking.

I am not sure for the issue of sun studio. Sun studio declares GNU asm compatibility, I am not sure whether it is 100% compatible. If issue is caused by Sun Studio itself, can we add #ifdef to avoid SSSE3 patch of Sun studio? In this case, how to determine Sun studio?

Thanks!
Samuel

-----Original Message-----
From: Siarhei Siamashka [mailto:siarhei.siamashka at gmail.com] 
Sent: Wednesday, September 08, 2010 3:39 AM
To: Ma, Ling
Cc: Xu, Samuel; Soeren Sandmann; pixman at lists.freedesktop.org; Liu, Xinyun
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote:
> > > >     Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > >     this problem?  Store forwarding does not seem to be used for SIMD
> > > >     according to the manual. I haven't benchmarked anything yet
> > > >     though.

> [Ma Ling] movd is not good for Atom because of it will use AGU- " * 
> Integer-FP/SIMD transfer: Instructions that transfer integer data to 
> the FP/SIMD side of the machine also uses AGU. Examples of these 
> instructions include MOVD, PINSRW. If one of the source register of 
> these instructions depends on the result of an execution unit, this 
> dependency will also cause a delay of 3 cycles." 12.3.2.2 Address 
> Generation

Well, a simple benchmark shows that MOVD performs at least as good as MOV if we compare the following loops:

 80483e0:       66 0f 6e 06             movd   (%esi),%xmm0
 80483e4:       66 0f eb c1             por    %xmm1,%xmm0
 80483e8:       66 0f 7e 07             movd   %xmm0,(%edi)
 80483ec:       49                      dec    %ecx
 80483ed:       75 f1                   jne    80483e0 <main+0x20>

vs.

 80483e0:       8b 06                   mov    (%esi),%eax
 80483e2:       09 d8                   or     %ebx,%eax
 80483e4:       89 07                   mov    %eax,(%edi)
 80483e6:       49                      dec    %ecx
 80483e7:       75 f7                   jne    80483e0 <main+0x20>

Both of these loops use 3 cycles per iteration as can be easily measured by a simple modification of the benchmark program that I posted earlier.
The only advantage of non-SSE code here is that it is smaller. But you are anyway killing this code size advantage by keeping both forward and backwards copy variants.

The part from the optimization manual which you quoted is related to moving data between x86 and SSE registers, which is indeed a bad idea. But there is no need doing anything like this. You can just load data directly from memory to SSE registers, do some operations with it (using SSE instructions) and store the result to memory.

> > Your code still can be simplified a lot. I'm just not quite sure 
> > whether it would be more practical to commit something first and 
> > then refactor it with the follow up commits. Or attempt to make a 
> > "perfect" patch before committing.

> [Ma Ling] Yes, I agree with you, let us commit it first, then strength 
> it, such as appending non-temporary instructions for large data copy 
> which is over L1 cache size.

Actually it's not up to me to decide. Much more important is whether Søren Sandmann agrees to this or not. I'm just afraid that trying to reach perfection, we may get stuck with no further progress at some point. Because for example some of the things, which are clear and simple for me, may be too unusual or unfamiliar for you. Or possibly the other way around.

But in any case, the other issues pointed by Søren still can and need to be addressed.

Just one last question to clarify things. Did you write this SSSE3 assembly code yourself or was the output of some C compiler (at least partially) used for it?

--
Best regards,
Siarhei Siamashka


More information about the Pixman mailing list