[cairo] [Pixman] Floating point API in Pixman

Wed Aug 25 03:18:25 PDT 2010

> What kind of hardware did you test by the way? And how did you calculate memory
> bandwidth percentage (it may be a bit tricky because this operation is kind of
> asymmetric and reads 5 bytes per pixel, while only writing 2)?

Those particular numbers are from a 3GHz C2D.  Unfortunately x86 is
still the platform with the best compiler support, so the numbers I
can presently get for ARM and PPC are not as favourable (bearing in
mind that the GCC I have for PPC32 is rather ancient - I will try to
upgrade it).  Even on x86 the compiler is clearly overwhelmed, judging
by some of the output code, and the fact that performance
paradoxically goes down when given a solid source and/or mask
(seriously...).  So much for "inline is as fast as a macro", and "the
compiler makes faster code than a human can".

Memory bandwidth is estimated by running memcpy() and counting total
bytes loaded *and* stored per second.  It's just a guide to estimate
efficiency, and does sometimes exceed 100% with SIMD paths.  That's
fine.

> But in any case, looks like you are setting the bar way too low and comparing
> very bad performance with even worse one here :)

And the comparison of "bad" with "worse" is valid, in that it is very
easy to accidentally trigger the "worse" cases with application code -
anything that is not yet implemented using a CPU-specific SIMD path,
and certainly anything that does not even have a pixman-fast-paths.c
entry.  I fully expect that SIMD paths written using full knowledge of
the CPU's capabilities will always outperform generic code - my aim is
merely to improve the generic code to reduce the need for CPU-specific
code.

> I don't see any way for this operation (btw, why did you select this one?) to be faster with a floating point implementation on ARM Cortex-A8 for example.

Comparing scalar fixed-point with the VFP unit on A8 will definitely
favour fixed-point, because the VFP instructions are non-pipelined for
no apparent reason (an oversight reportedly fixed in A9 and probably
A5 too).  Comparing NEON fixed-point with NEON floating-point will
*usually* favour fixed-point for simpler operations, because it can
simply do more of them per cycle and the setup overhead for
floating-point is high.  But if you use NEON instructions in a scalar
manner, as early versions of LLVM can, I think the balance is tipped
in favour of floating-point just as much as for full desktop CPUs.

As for why I selected that operation specifically, it was a relatively
simple one which appeared in my benchmark run and illustrated my
point.  It happens to be simple enough that the compiler didn't choke
too much on all the inlined dead code.

Switching viewpoint from complete fastpaths to combiners, the more
complex combiners (especially the PDF series that probably get used by
Cairo - any numbers on that?) are definitely faster in floating-point
than in fixed-point, even when the penalty of additional memory
bandwidth is taken into account.  Part of this is due to a rewrite
which improves algorithmic efficiency in a few cases, and uses single
precision instead of double internally for some of the nastier PDF
cases.  That result holds up to a limited extent even on Cortex-A8
using the VFP, which is pretty much the worst case available for this
conversion.

 - Jonathan Morton