[cairo] pixman: New ARM NEON optimizations

Tue Oct 27 07:13:15 PDT 2009

> > Right, instruction scheduling is the main disadvantage to using the
> > assembler as opposed to intrinsics. Cortex A9 is out-of-order I
> > believe, so it will have different scheduling requirements than the
> > A8, which in turn likely has different scheduling requirements than
> > earlier in-order CPUs. Though, a reasonable approach might be to
> > assume that the A9 will not be very sensitive to scheduling at all
> > (due to it being out-of-order), and then simply optimize for the A8.
> 
> A9 has a big downside though: the NEON block was cut-down to make place
> for a full VFP block instead of the VFPlite block in the A8 cores, so
> using NEON will be slower compared to A8. Hopefully the out-of-order
> will make up for it, but pure NEON code will take a hit.

Where did you hear this? A8 does indeed have VFPlite (as opposed to A9's full VFP
implementation), but A9 certainly doesn't have a cut-down NEON.

Something to be aware of, however, is that A9 can't issue a VFP instruction until the previous
NEON instruction has cleared out of the pipeline (and vice-versa), so interleaving NEON and VFP
really hurts on A9 but didn't really make a huge difference on A8.