[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Tue Apr 5 04:31:50 PDT 2011

On Mon, Apr 4, 2011 at 7:12 PM, Taekyun Kim <podain77 at gmail.com> wrote:
> I've done various experiments on PLD instruction.
> I removed cache preload in neon fast path functions and then benchmarked,
> there was no difference at performance.
> I tested some other neon functions (like memcpy) in similar way, but no
> difference at all.

I can try to explain the results from
http://lists.freedesktop.org/archives/pixman/2011-April/001156.html :)

First of all, NEON unit in ARM Cortex-A8 was supposed to have direct
access to L2 cache as described in
http://www.arm.com/files/pdf/A8_Paper.pdf
You can check section "5.3 Non-blocking NEON loads" for more details.
But unfortunately early revisions of Cortex-A8 (r1pX) such as the one
used in Nokia N900 had a hardware bug which required this direct
assess to L2 cache to be disabled by setting L1NEON bit in Auxiliary
Control Register:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Bgbffjhh.html

That surely causes major differences in the case of not using
prefetch. But how come that the performance of simple memcpy is even
faster without any explicit prefetch? My understanding is that this
happens because NEON instructions are executed with a significant
delay relative to ARM pipeline. So basically, the flow of NEON
instructions is passing through ARM pipeline, all the memory addresses
for load/store instructions are resolved there and then the NEON
instructions are put into a separate long queue to be actually
executed much later. So for simple copy, the processor can easily see
lots of VLD1 instructions in the queue, understand that we are reading
really far ahead and keep memory controller busy, in this case PLD
instructions would just interfere unnecessarily. But for more
computationally intensive functions such as bilinear scaling, we don't
have NEON queue flooded with that many VLD1 instructions happening far
ahead (just because we need to fill the queue with arithmetic
instructions too), so explicit prefetch is still needed.

Anyway, there are many Nokia N900 users around and also the users of
similar devices. Disabiling prefetch for the cases like simple copy
where more modern Cortex-A8 processors do not strictly need it would
cause a serious performance regression on older hardware.

> As I know coretex-a8 have preload engine (maybe not according to different
> SoC integration??)

ARM Cortex-A8 does not have automatic hardware prefetch. My
understanding is that prefetch should be done explicitly using either
PLD instructions, or by programming PLE engine (something which is not
normally accessible from userspace, so we can forget about it).

> but PLD is just an hint to the HW.
> So it is implementation dependent, right?

PLD should work on all Cortex-A8 systems unless disabled for whatever
reason (for example to workaround some bug).

-- 
Best regards,
Siarhei Siamashka