[Pixman] [PATCH] sse2: Using MMX and SSE 4.1

Wed May 9 09:57:39 PDT 2012

Matt Turner <mattst88 at gmail.com> writes:

> I started porting my src_8888_0565 MMX function to SSE2, and in the
> process started thinking about using SSE3+. The useful instructions
> added post SSE2 that I see are
> 	SSE3:	lddqu - for unaligned loads across cache lines

I don't really understand that instruction. Isn't it identical to
movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
splits, but slower for plain old, non-cache split unaligned loads?

> 	SSSE3:	palignr - for unaligned loads (but requires software
> 			  pipelining...)
> 		pmaddubsw - maybe?

pmaddubsw would be very useful for bilinear interpolation if we drop
coordinate precision to 7 bits instead of the current 8. One example way
to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register,
and interleave a top/left and top/right pixels in another. pmaddubsw on
those two registers will then produce a linear interpolation between the
top top pixels. A similar thing can be done for the bottom pixels, and
then the intermediate results can be interleaved and combined using
pmaddwd.

> 	SSE4.1: pextr*, pinsr* pcmpeqq, ptest packusdw - for 888 -> 565
> 	packing
>
> I first wrote a basic src_8888_0565 for SSE2 and discovered that the
> performance was worse than MMX (which we've been saying has no use in
> modern systems -- oops!). I figured the cool pmadd algorithm of MMX was
> the cause, but I wondered if 16-byte SSE chunks are too large
> sometimes.
>
> I added an 8-byte MMX loop before and after the main 16-byte SSE loop
> and got a nice improvement.
>

I still think MMX has no use on modern systems. The SSE2 implementation
used to have such MMX loops, but they were removed in
f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with
compilers that would miscompile the emms instruction.

Can't the MMX loop you added be done with SSE registers and instructions
as well?

> Porting the pmadd algorithm to SSE4.1 gave another (very large)
> improvement.
>
> fast:	src_8888_0565 = L1: 655.18  L2: 675.94  M:642.31  ( 23.44%) HT:403.00  VT:286.45  R:307.61  RT:150.59 (1675Kops/s)
> mmx:	src_8888_0565 = L1:2050.45  L2:1988.97  M:1586.16 ( 57.34%) HT:529.12  VT:374.28  R:412.09  RT:177.35 (1913Kops/s)
> sse2:	src_8888_0565 = L1:1518.61  L2:1493.10  M:1279.18 ( 46.24%) HT:433.65  VT:314.48  R:349.14  RT:151.84 (1685Kops/s)
> sse2mmx:src_8888_0565 = L1:1544.91  L2:1520.83  M:1307.79 ( 47.01%) HT:447.82  VT:326.81  R:379.60  RT:174.07 (1878Kops/s)
> sse4:	src_8888_0565 = L1:4654.11  L2:4202.98  M:1885.01 ( 69.35%) HT:540.65  VT:421.04  R:427.73  RT:161.45 (1773Kops/s)
> sse4mmx:src_8888_0565 = L1:4786.27  L2:4255.13  M:1920.18 ( 69.93%)
> HT:581.42  VT:447.99  R:482.27  RT:193.15 (2049Kops/s)
>
> I'd like to isolate exactly what the performance improvement given by
> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
> SSE4.1 a fantastic improvement. If you can come up with a reasonable way
> to pack the two xmm registers together in pack_565_2packedx128_128,
> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.

Would it work to subtract 0x8000, then use packssdw, then add 0x8000?

> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd
> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a
> compile-time option (or recompile the whole file to get a few
> improvements from SSE4.1).
>
> It seems like we need a generic solution that would say for each
> compositing function
> 	- this is what you do for 1-byte;
> 	- this is what you do for 8-bytes if you have MMX;
> 	- this is what you do for 16-bytes if you have SSE2;
> 	- this is what you do for 16-bytes if you have SSE3;
> 	- this is what you do for 16-bytes if you have SSE4.1.
> and then construct the functions for generic/MMX/SSE2/SSE4 at build time.
>
> Does this seem like a reasonable approach? *How* to do it -- suggestions
> welcome.

I think ideally we would generate this code at runtime. It's just not
feasible to generate code for all combinations of instruction sets at
build time and libpixman.so is already rather large. Generating the code
at runtime has the additional advantages that it is not limited to a
fixed set of fast paths and that it can make use of more details of the
operation such as the precise alignment for palignr generation.

There are various ways to go about this, ranging from simple-minded
stitching-together of pre-written snippets to a full shader compiler. A
full shader compiler is obviously a big project, but maybe a simple
stich-together kind of thing wouldn't actually be that hard using
something like this:

http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph
http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph

That said, runtime code-generation is still a big project, and it does
make sense to make use of some of the newer instruction sets.

We do have support for fallbacks, so as Makoto-san says, just adding new
pixman-ssse3.c and pixman-sse41.c files with duplicated code for the
particular operations that benefit from ssse3 and sse4.1 might be the
simplest way to proceed.

Soren