[Pixman] [RFC] AVX codepaths
sandmann at cs.au.dk
Mon May 16 08:00:44 PDT 2011
Matt Turner <mattst88 at gmail.com> writes:
> I took the SSE2 code paths and modified them slightly to generate AVX
> code. AVX doesn't have integer operations like SSE2, so the range of
> 256-bit operations is mostly limited to copying pixels.
> Following this email are three patches:
> 1) AVX fetcher for x8r8g8b8 (a8 and r5g6b5 require integer unpack operations)
> 2) AVX blt, composite_copy_area and associated fast paths
> 3) AVX fill
> `make check` still passes all 18 tests.
> The only other bit of low-hanging fruit I see is
> composite_src_x888_8888, but perhaps there are more fast paths
> possible with AVX.
Of the low-hanging fruit, this is probably the main one. One other might
be to compile pixman-sse2.c with -mavx in order to take advantage of the
VEX prefix for the integer SSE2 instructions. I'm not sure it would be
worth the build system complications though.
Of the higher-hanging fruit that AVX could be used for, gradients is the
obvious candidate. Radial gradients in particular could become really
seriously faster. I think Andrea has some ideas about this. The
moonlight project has a copy of pixman with some gradient performance
improvements using SSE. See this git repository:
Maybe some of that code could be reused, but I think pixman master has
diverged a bit from when they forked.
It would also be useful to redo the wide path used for 10bpc formats to
make it use single precision floating point instead. There is a branch
that has a start on this. Clearly, if this branch of something similar
gets cleaned up and landed, AVX would be useful to accelerate it.
Also, more generally, it would be interesting to move the test suite
from the current bit-exact standard towards a more close-enough
standard, so that floating point SIMD or GPUs or narrower integer
arithmetic could be used in cases where we currently rely on
precisely-8-bit integer arithmetic.
> I played around with cairo-perf-trace on my Sandy Bridge, but wasn't
>able to see any improvements, which may be for any number of reasons,
>including but not limited to a) I don't know how to use
>cairo-perf-trace, b) AVX doesn't yield any noticeable improvements, c)
>the code needs to be better optimized.
It's pretty likely that these operations are already close to the memory
bandwidth limit. There is a program in test/lowlevel-blt-bench that
attempts to measure performance of various operations relative to memory
bandwidth. Maybe it can produce some interesting results.
> Still TODO are AVX cpuid detection and to figure out about SunCC
Regarding SunCC, feel free to just support GCC initially. The various
types of compiler support really have to be maintained by people who use
those compilers regularly.
I'll try to take a look at the patches later this week, but I don't have
any Sandy Bridge hardware atm., so I can't actually run it.
More information about the Pixman