[pulseaudio-discuss] [PATCH] core: Fix a litte-endian bug in ARM svolume code
Peter Meerwald
pmeerw at pmeerw.net
Tue Oct 23 05:28:45 PDT 2012
Hello Arun,
> > > > checking NEON volume_float32ne
> > > > NEON: 10223 usec.
> > > > ref: 46480 usec.
> > > > checking NEON volume_s16ne
> > > > NEON: 8484 usec.
> > > > ARM: 339272 usec.
> > > > ref: 20203 usec.
I was testing with SAMPLES 1019; while you are likely testing with
SAMPLES 1022
Checking ARMv6 svolume (with 1019 samples)
func: 33923743 usec (min = 338868, max = 341919, stddev = 365.753).
orig: 2430664 usec (min = 24261, max = 24445, stddev = 42.2141).
Checking ARMv6 svolume (with 1022 samples)
func: 915036 usec (min = 9094, max = 9338, stddev = 50.2385).
orig: 2437988 usec (min = 24322, max = 24536, stddev = 48.1282).
> > > That's odd indeed. I have this on a Freescale i.mx53 (also Cortex A8)
> > > Checking ARM svolume
> > > func: 905150 usec (min = 9006, max = 9562, stddev = 76.1938).
> > > orig: 2278824 usec (min = 22760, max = 23252, stddev = 65.5575).
I get similar numbers with SAMPLES 1022 on a beagle-xm; I think you'll
see catastrophic runtime with SAMPLES 1019
comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs,
while ARM can do with one -- the ARM instructions (smulwb, ssat) look
ideal for the svolume_s16 code
three observations:
(1) when the number of samples is odd, the ARM code processes the first
sample before switching to the unrolled 4-samples-at-a-time loop; this
causes the samples pointer to become misaligned (2-byte align) (assuming
it was 4-byte aligned initially)
I am not sure what guarantees PulseAudio gives on buffer alignment
(2) the NEON code generally fails when input data length < 4; can be
easily fixed
(3) neither ARM nor NEON code cares about alignment; just the strategy is
different
ARM handles cases where length % 3 != 0 first (before entering the
unrolled loop); which is bad when the sample buffer is aligned
NEON takes care of length % 3 != 0 for the last samples; which is good
when the smaple buffer is aligned
> > # ./cpu-test
> > Running suite(s): CPU
> > CPU flags: V6 V7 VFP EDSP NEON VFPV3 Cortex-A8
> > Initialising ARM optimized volume functions.
> > Checking ARM svolume
> > 0: 1ac8 != 390e (43e9 * 0000d716)
> > Orc not supported. Skipping
> > 50%: Checks: 2, Failures: 1, Errors: 0
> > tests/cpu-test.c:52:F:svolume:svolume_arm_test:0: Failed
> Does this include the little-endianness fix?
my fault; took the latest source, but failed to make sure that the proper
.so was linked -> it works with little-endian fixes actually deployed
> My current testing shows NEON svolume code with int16 samples
> consistently slower than the ARM code (tried on the Pandaboard, i.mx51,
> i,mx53, imx.6) by ~10% in most cases.
I agree; I think the ARM code is pretty good for s16 handling plus I think
enabling NEON is expensive power-wise -- so svolume_s16_neon should be
dropped and svolume_s16_arm should be improved to handle odd buffers
nicely
I just ignored the ARM code before since I was always testing with SAMPLES
1019 this hitting the worst case runtime wise
regards, p.
--
Peter Meerwald
+43-664-2444418 (mobile)
More information about the pulseaudio-discuss
mailing list