[Pixman] [PATCH] sse2: Using MMX and SSE 4.1

Thu Jun 14 06:45:46 PDT 2012

Sorry it's taken so long to get back to this.

On Wed, May 9, 2012 at 12:57 PM, Søren Sandmann <sandmann at cs.au.dk> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
> I still think MMX has no use on modern systems. The SSE2 implementation
> used to have such MMX loops, but they were removed in
> f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with
> compilers that would miscompile the emms instruction.
>
> Can't the MMX loop you added be done with SSE registers and instructions
> as well?

The registers -- yes. The 8-byte aligned loads and stores I'm not
sure. Can you do 8-byte aligned loads and stores to/from SSE
registers?

>> Porting the pmadd algorithm to SSE4.1 gave another (very large)
>> improvement.
>>
>> fast: src_8888_0565 = L1: 655.18  L2: 675.94  M:642.31  ( 23.44%) HT:403.00  VT:286.45  R:307.61  RT:150.59 (1675Kops/s)
>> mmx:  src_8888_0565 = L1:2050.45  L2:1988.97  M:1586.16 ( 57.34%) HT:529.12  VT:374.28  R:412.09  RT:177.35 (1913Kops/s)
>> sse2: src_8888_0565 = L1:1518.61  L2:1493.10  M:1279.18 ( 46.24%) HT:433.65  VT:314.48  R:349.14  RT:151.84 (1685Kops/s)
>> sse2mmx:src_8888_0565 = L1:1544.91  L2:1520.83  M:1307.79 ( 47.01%) HT:447.82  VT:326.81  R:379.60  RT:174.07 (1878Kops/s)
>> sse4: src_8888_0565 = L1:4654.11  L2:4202.98  M:1885.01 ( 69.35%) HT:540.65  VT:421.04  R:427.73  RT:161.45 (1773Kops/s)
>> sse4mmx:src_8888_0565 = L1:4786.27  L2:4255.13  M:1920.18 ( 69.93%)
>> HT:581.42  VT:447.99  R:482.27  RT:193.15 (2049Kops/s)
>>
>> I'd like to isolate exactly what the performance improvement given by
>> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
>> SSE4.1 a fantastic improvement. If you can come up with a reasonable way
>> to pack the two xmm registers together in pack_565_2packedx128_128,
>> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.
>
> Would it work to subtract 0x8000, then use packssdw, then add 0x8000?

I couldn't make that work, but I asked on StackOverflow and got a nice
solution: http://stackoverflow.com/questions/11024652/simulating-packusdw-functionality-with-sse2

>> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd
>> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a
>> compile-time option (or recompile the whole file to get a few
>> improvements from SSE4.1).
>>
>> It seems like we need a generic solution that would say for each
>> compositing function
>>       - this is what you do for 1-byte;
>>       - this is what you do for 8-bytes if you have MMX;
>>       - this is what you do for 16-bytes if you have SSE2;
>>       - this is what you do for 16-bytes if you have SSE3;
>>       - this is what you do for 16-bytes if you have SSE4.1.
>> and then construct the functions for generic/MMX/SSE2/SSE4 at build time.
>>
>> Does this seem like a reasonable approach? *How* to do it -- suggestions
>> welcome.
>
> I think ideally we would generate this code at runtime. It's just not
> feasible to generate code for all combinations of instruction sets at
> build time and libpixman.so is already rather large. Generating the code
> at runtime has the additional advantages that it is not limited to a
> fixed set of fast paths and that it can make use of more details of the
> operation such as the precise alignment for palignr generation.
>
> There are various ways to go about this, ranging from simple-minded
> stitching-together of pre-written snippets to a full shader compiler. A
> full shader compiler is obviously a big project, but maybe a simple
> stich-together kind of thing wouldn't actually be that hard using
> something like this:
>
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph
>
> That said, runtime code-generation is still a big project, and it does
> make sense to make use of some of the newer instruction sets.
>
> We do have support for fallbacks, so as Makoto-san says, just adding new
> pixman-ssse3.c and pixman-sse41.c files with duplicated code for the
> particular operations that benefit from ssse3 and sse4.1 might be the
> simplest way to proceed.

Indeed, runtime generation would be great. Something like LLVM or orc
would be interesting options. I'm not sure I'm up to that kind of
project yet/now though.

I think adding pixman-sse*.c files is a reasonable measure for now.
Think it's okay to split the static inline support functions from
pixman-sse2.c out into a header to be shared with the other
pixman-sse*.c files?

Also, are we planning to change the bilinear scaling algorithm for
0.28 so that we can use pmaddubsw?

Thanks,
Matt