[Pixman] ARM iwmmxt patches

Sat Jul 23 18:26:24 PDT 2011

On Wed, Jul 20, 2011 at 10:43 PM, Matt Turner <mattst88 at gmail.com> wrote:
> Hi,
>
> The 3 patch series adds support for compiling pixman's pixman-mmx.c
> for ARM/iwmmxt for some performance improvements on iwmmxt-enabled ARM
> CPUs. This is done by taking advantage of the fact that gcc provides
> MMX-compatible _mm_*-style intrinsics for iwmmxt on ARM.

Interesting. This intrinsics compatibility with MMX was probably the
explanation for the otherwise rather questionable "Marvell argues that
Wireless MMX2 penetration is higher than NEON" statement from
    http://www.anandtech.com/show/2860

> On my OLPC XO 1.75 (with a Marvell CPU), they pass the pixman test
> suite (verified that test suite passes on x86/MMX as well) and improve
> performance of most cairo-traces 7% or more. (See attached)

> For lowlevel-blit-bench, iwmmxt paths are not always faster, at times
> losing to ARMv6 or geneic paths (but even ARMv6 is sometimes slower
> than generic...) but providing some massive speed-ups at times:

The lowlevel-blit-bench is not perfect. It only benchmarks the
'translucent' case (alpha is neither 0x00 nor 0xFF) and may be
misleading if used for performance tuning anything other than NEON
optimizations (such branches to handle 0x00 and 0xFF alpha special
cases are extremely expensive on Cortex-A8, which is a big challenge
to make any use of them). The primary purpose of this benchmark
program was to compare CPU performance doing compositing operation vs.
the available memory bandwidth and tweak software prefetch if the
target hardware does not have a properly working automatic hardware
prefetcher. You may experiment a bit with __builtin_prefetch() for
iwmmxt optimizations and probably improve performance of some of the
iwmmxt functions, but using software prefetch may make skipping over
transparent pixels not provide any memory bandwidth savings as
explained in:
      http://lists.freedesktop.org/archives/pixman/2010-December/000790.html

As for the possible reasons for ARMv6 slowness (mainly in cairo-traces
which reflect real performance), it could be:
1. Delegates overhead. If the implementation provides only a few
optimized functions, fallbacks to generic implementation happen quite
often and this is not so cheap, especially for the combiners. I
observed this effect when running some MIPS benchmarks earlier.
2. Looks like handling of special cases for 0x00 and 0xFF alpha
(controlled by 'inner_branch' macro) is disabled now:
      http://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-simd.c?id=pixman-0.23.2#n119
3. The code is not perfect even for its original ARM11 target and a
few cycles can be saved here and there as mentioned in 'TODO' comment,
but nothing done yet:
      http://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-simd-asm.S?id=pixman-0.23.2#n51
4. Marvell CPU may not like some particular instructions. For example,
instructions UXTAB/UXTAB16 are very slow on Cortex-A9 and should be
avoided. The peak throughput is only one UXTAB every ~2 cycles for
Cortex-A9, while Cortex-A8 could execute two of such instructions per
cycle and ARM11 could execute one per cycle.

ARMv6 pixman optimizations are orphaned and not really maintained for
a long time. But if there is some interest in improving performance on
ARMv6 hardware, at least a few old patches could be rebased,
benchmarked and merged if they are still useful:
      http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-optimizations-from-xomap

> gcc's current support for iwmmxt code generation is atrocious (See gcc
> bugs 35294, 36798, 36966), so I have patched gcc to add missing shift
> and logical iwmmxt instructions. I have seen patches posted improving
> gcc's iwmmxt support, so I hope that gcc-4.7 will be able to use
> pixman's iwmmxt code without trouble. (Reminds me as I write this that
> I need to modify the configure.ac test to use instructions that cause
> current gcc to crash.)

If OLPC XO 1.75 gets popular enough, it could be a good motivating
factor to fix iwmmxt support in gcc :) Though using intrinsics relies
on the compiler to do registers allocation and instructions
scheduling, but modern compilers are still not so great.

-- 
Best regards,
Siarhei Siamashka