[pulseaudio-discuss] [PATCH 1/6 v3] core: Initialize ARM NEON code if available

Wed Oct 17 07:33:58 PDT 2012

On Wed, 17 Oct 2012 16:10:49 +0200 (CEST), Peter Meerwald
<pmeerw at pmeerw.net> wrote:
> Hello,
> 
>> Surprise! I'm reviewing this now. :p
> 
> indeed :)
> 
>> 1. v3 drops intrinsics in favour of inline asm -- is that for
>> performance reasons?
> 
> I noticed performance issues with certain compiler versions; inline asm 
> offers more control/defined output; further, alignment annotations are
not 
> available with intrinsics -- currently they are not used because I'm not

> sure about the alignment guarantees of certain PA buffers; intrinsics
> could probably be added later if there is enough interest

First thing you should prefetch data (PLD). Then you should schedule and
unroll the code. Only last, bother optimizing aligned load/store.

>> 2. In the mono->stereo float case, the Cortex A9 code is actually
>> slower. I recall that in a previous thread, we had this sort of
>> situation on one of Panda/Beagleboard. Do we need some way to pick and
>> choose implementations?
> 
> I only have beagleboard-xm and pandabaord available as test platforms 
> (Cortax A8 and A9, resp.)
> PATCH 2/6 now tests for A8 vs A9/A15/Axxx and chooses code accordingly
> 
> another issue is benchmarking: relative performance is different
depending 
> on the length of the buffers processed, whether they are cached

I'd read the Cortex-A8 instruction timing first to see how much unrolling
is needed. Otherwise, the implementation will be so far off from the
optimum that benchmarking is a bit silly.

> my target task involves stereo recording, resampling, int/float 
> conversion, stereo-to-mono and mono-to-stereo mapping and I am seeing
good 
> speedups on both beagle- and pandaboard
>  
> I need to check the downmix to mono behaviour after 
> ff4af902cf4ac07c5f1da3b6dacbb3195c7c222d
>     resampler: Fix volume on downmix to mono
> 
>> 3. How shall we go about enabling this code? Have a configure time
check
>> for some instructions that are needed, build it in if available, and
>> then run-time detection should pick the right code path?
> 
> I'd suggest to model after bluetooth/sbc: compile the *_neon.c files 
> always but only activate the NEON code if defined(__ARM_NEON__)
> 
> disadvantage is that we cannot have a common executable for
NEON/non-NEON 
> ARM CPUs -- I don't think this is a big constraint
> 
> Remi Denis-Courmont suggests to use .s assembler files to overcome this 
> issue; this would necessitate some configure options as well

With .s files, you can override the target FPU inline (.mfpu neon),
overcoming the __ARM_NEON__ problem. As an alternative, you could compile
all the intrinsic and inline assembler in a separate static import library
(noinst_LTLIBRARIES). Then you can pass different CFLAGS than for the rest
of the project.

> interestingly, on x86/AMD64 gcc can emit MMX/SSE code in inline asm even

> when the compiler itself is not enabled to generate such instructions --

> hence no .s files in PA so far

Yes and no. You can emit MMX/SSE irrespective of compiler flags, but you
cannot specify MM and XMM registers in the clobber list unless MMX and SSE
1 are enabled respectively. Thus, you end up with potentially invalid byte
code. This is especially likely to break when you compile for x86-64 where
GCC uses SSE for regular floating point operations.

However, on x86, you can enable MMX or SSE on a per-function basis:
__attribute__((target("mmx")))
__attribute__((target("sse")))

Alas, this not supported on ARM so far.

> at runtime there already is an env. var PULSE_NO_SIMD to disable
optimized 
> code path; further the output of /proc/cpuinfo is parsed to see if NEON
is 
> available (kind of pointless since it is a compile-time decision)

Yes, it is pointless as there is no warranty that the code will run on
non-NEON processors.

-- 
Rémi Denis-Courmont
Sent from my collocated server