[Pixman] [RFC] AVX codepaths

Sat May 21 15:22:12 PDT 2011

On Mon, May 16, 2011 at 11:00 AM, Soeren Sandmann <sandmann at cs.au.dk> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
>
>> I took the SSE2 code paths and modified them slightly to generate AVX
>> code. AVX doesn't have integer operations like SSE2, so the range of
>> 256-bit operations is mostly limited to copying pixels.
>>
>> Following this email are three patches:
>>   1) AVX fetcher for x8r8g8b8 (a8 and r5g6b5 require integer unpack operations)
>>   2) AVX blt, composite_copy_area and associated fast paths
>>   3) AVX fill
>>
>> `make check` still passes all 18 tests.
>>
>> The only other bit of low-hanging fruit I see is
>> composite_src_x888_8888, but perhaps there are more fast paths
>> possible with AVX.
>
> Of the low-hanging fruit, this is probably the main one. One other might
> be to compile pixman-sse2.c with -mavx in order to take advantage of the
> VEX prefix for the integer SSE2 instructions. I'm not sure it would be
> worth the build system complications though.
>
> Of the higher-hanging fruit that AVX could be used for, gradients is the
> obvious candidate. Radial gradients in particular could become really
> seriously faster. I think Andrea has some ideas about this. The
> moonlight project has a copy of pixman with some gradient performance
> improvements using SSE. See this git repository:
>
>    https://github.com/mono/moon
>
> Maybe some of that code could be reused, but I think pixman master has
> diverged a bit from when they forked.
>
> It would also be useful to redo the wide path used for 10bpc formats to
> make it use single precision floating point instead. There is a branch
> here:
>
>    http://cgit.freedesktop.org/~ranma42/pixman/log/?h=wip/floatpipe4
>
> that has a start on this. Clearly, if this branch of something similar
> gets cleaned up and landed, AVX would be useful to accelerate it.
>
> Also, more generally, it would be interesting to move the test suite
> from the current bit-exact standard towards a more close-enough
> standard, so that floating point SIMD or GPUs or narrower integer
> arithmetic could be used in cases where we currently rely on
> precisely-8-bit integer arithmetic.

Sounds like a plan. It'll be fun to optimize this stuff for AVX.

>> I played around with cairo-perf-trace on my Sandy Bridge, but wasn't
>>able to see any improvements, which may be for any number of reasons,
>>including but not limited to a) I don't know how to use
>>cairo-perf-trace, b) AVX doesn't yield any noticeable improvements, c)
>>the code needs to be better optimized.
>
> It's pretty likely that these operations are already close to the memory
> bandwidth limit. There is a program in test/lowlevel-blt-bench that
> attempts to measure performance of various operations relative to memory
> bandwidth. Maybe it can produce some interesting results.

I'll try to do more testing and figure out what's going on. Perhaps
it's that since -march=native already turns on -mavx, I'm already
seeing some kind of improvement in the SSE2 code which is masking the
difference between it and the AVX paths.

>> Still TODO are AVX cpuid detection and to figure out about SunCC
>> support.
>
> Regarding SunCC, feel free to just support GCC initially. The various
> types of compiler support really have to be maintained by people who use
> those compilers regularly.

Works for me.

> I'll try to take a look at the patches later this week, but I don't have
> any Sandy Bridge hardware atm., so I can't actually run it.

In the mean time, I've pushed my patches to the avx-optimizations
branch of git://anongit.freedesktop.org/~mattst88/pixman. I also added
an AVX path for composite_src_x888_8888.

(There is also a branch of Alpha/MVI optimization, but I'm not seeing
any improvements, and write-hint prefetching seems to be a massive
loss in terms of the number of cache-misses, although the numbers of
cycles and instructions remain identical. No idea what's going on
yet.)

Thanks,
Matt