[Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Matt Turner
mattst88 at gmail.com
Thu Jun 14 06:45:46 PDT 2012
Sorry it's taken so long to get back to this.
On Wed, May 9, 2012 at 12:57 PM, Søren Sandmann <sandmann at cs.au.dk> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
> I still think MMX has no use on modern systems. The SSE2 implementation
> used to have such MMX loops, but they were removed in
> f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with
> compilers that would miscompile the emms instruction.
>
> Can't the MMX loop you added be done with SSE registers and instructions
> as well?
The registers -- yes. The 8-byte aligned loads and stores I'm not
sure. Can you do 8-byte aligned loads and stores to/from SSE
registers?
>> Porting the pmadd algorithm to SSE4.1 gave another (very large)
>> improvement.
>>
>> fast: src_8888_0565 = L1: 655.18 L2: 675.94 M:642.31 ( 23.44%) HT:403.00 VT:286.45 R:307.61 RT:150.59 (1675Kops/s)
>> mmx: src_8888_0565 = L1:2050.45 L2:1988.97 M:1586.16 ( 57.34%) HT:529.12 VT:374.28 R:412.09 RT:177.35 (1913Kops/s)
>> sse2: src_8888_0565 = L1:1518.61 L2:1493.10 M:1279.18 ( 46.24%) HT:433.65 VT:314.48 R:349.14 RT:151.84 (1685Kops/s)
>> sse2mmx:src_8888_0565 = L1:1544.91 L2:1520.83 M:1307.79 ( 47.01%) HT:447.82 VT:326.81 R:379.60 RT:174.07 (1878Kops/s)
>> sse4: src_8888_0565 = L1:4654.11 L2:4202.98 M:1885.01 ( 69.35%) HT:540.65 VT:421.04 R:427.73 RT:161.45 (1773Kops/s)
>> sse4mmx:src_8888_0565 = L1:4786.27 L2:4255.13 M:1920.18 ( 69.93%)
>> HT:581.42 VT:447.99 R:482.27 RT:193.15 (2049Kops/s)
>>
>> I'd like to isolate exactly what the performance improvement given by
>> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
>> SSE4.1 a fantastic improvement. If you can come up with a reasonable way
>> to pack the two xmm registers together in pack_565_2packedx128_128,
>> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.
>
> Would it work to subtract 0x8000, then use packssdw, then add 0x8000?
I couldn't make that work, but I asked on StackOverflow and got a nice
solution: http://stackoverflow.com/questions/11024652/simulating-packusdw-functionality-with-sse2
>> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd
>> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a
>> compile-time option (or recompile the whole file to get a few
>> improvements from SSE4.1).
>>
>> It seems like we need a generic solution that would say for each
>> compositing function
>> - this is what you do for 1-byte;
>> - this is what you do for 8-bytes if you have MMX;
>> - this is what you do for 16-bytes if you have SSE2;
>> - this is what you do for 16-bytes if you have SSE3;
>> - this is what you do for 16-bytes if you have SSE4.1.
>> and then construct the functions for generic/MMX/SSE2/SSE4 at build time.
>>
>> Does this seem like a reasonable approach? *How* to do it -- suggestions
>> welcome.
>
> I think ideally we would generate this code at runtime. It's just not
> feasible to generate code for all combinations of instruction sets at
> build time and libpixman.so is already rather large. Generating the code
> at runtime has the additional advantages that it is not limited to a
> fixed set of fast paths and that it can make use of more details of the
> operation such as the precise alignment for palignr generation.
>
> There are various ways to go about this, ranging from simple-minded
> stitching-together of pre-written snippets to a full shader compiler. A
> full shader compiler is obviously a big project, but maybe a simple
> stich-together kind of thing wouldn't actually be that hard using
> something like this:
>
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph
>
> That said, runtime code-generation is still a big project, and it does
> make sense to make use of some of the newer instruction sets.
>
> We do have support for fallbacks, so as Makoto-san says, just adding new
> pixman-ssse3.c and pixman-sse41.c files with duplicated code for the
> particular operations that benefit from ssse3 and sse4.1 might be the
> simplest way to proceed.
Indeed, runtime generation would be great. Something like LLVM or orc
would be interesting options. I'm not sure I'm up to that kind of
project yet/now though.
I think adding pixman-sse*.c files is a reasonable measure for now.
Think it's okay to split the static inline support functions from
pixman-sse2.c out into a header to be shared with the other
pixman-sse*.c files?
Also, are we planning to change the bilinear scaling algorithm for
0.28 so that we can use pmaddubsw?
Thanks,
Matt
More information about the Pixman
mailing list