[Pixman] [PATCH] sse2: Using MMX and SSE 4.1

Wed May 9 10:24:37 PDT 2012

On 2012-05-09, at 12:57 PM, Søren Sandmann wrote:

> Matt Turner <mattst88 at gmail.com> writes:
> 
>> I started porting my src_8888_0565 MMX function to SSE2, and in the
>> process started thinking about using SSE3+. The useful instructions
>> added post SSE2 that I see are
>> 	SSE3:	lddqu - for unaligned loads across cache lines
> 
> I don't really understand that instruction. Isn't it identical to
> movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
> splits, but slower for plain old, non-cache split unaligned loads?

"The instructions movdqu, movups, movupd and lddqu are all able to read unaligned vectors. lddqu is faster than the alternatives on P4E and PM processors, but requires the SSE3 instruction set. The unaligned read instructions are relatively slow on older processors, but faster on Nehalem, Sandy Bridge and on future AMD and Intel processors."

From http://www.agner.org/optimize/optimizing_assembly.pdf

-Jeff