[pulseaudio-discuss] [PATCH 11/11] remap: Add stereo to mono and 4-channel special case remapping
Alexander E. Patrakov
patrakov at gmail.com
Sat Apr 26 05:19:57 PDT 2014
24.04.2014 22:09, Peter Meerwald wrote:
> From: Peter Meerwald <p.meerwald at bct-electronic.com>
>
> The generic matrix remapping is rather inefficient; special-case code
> improves performance by 3x easily.
I have looked at this and the 10th patch. For 10/11, I have no
objections. 11/11 definitely works and improves things, but...
> +static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, const int16_t *src, unsigned n) {
> + unsigned i;
> +
> + for (i = n >> 2; i > 0; i--) {
> + dst[0] = (src[0] + src[1])/2;
> + dst[1] = (src[2] + src[3])/2;
> + dst[2] = (src[4] + src[5])/2;
> + dst[3] = (src[6] + src[7])/2;
> + src += 8;
> + dst += 4;
> + }
> + for (i = n & 3; i; i--) {
> + dst[0] = (src[0] + src[1])/2;
> + src += 2;
> + dst += 1;
> + }
> +}
Why are we doing the compiler's job here? Yes, I understand that there
are precedents of manually unrolling the loop here, but this actually
slows things down with -O3 on gcc-4.8.2! Here are my results regarding
stereo to mono s16ne conversions with different CFLAGS on an amd64
machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo Boost).
The tests below are with the cpu-test rework patches applied (but not
reviewed).
With -O2 -pipe, and your code, I get:
Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442).
orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169).
With -O3 -pipe, and your code, I get:
Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987).
orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764).
Now let's test this:
static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst,
const int16_t *src, unsigned n) {
while (n--) {
dst[0] = (src[0] + src[1])/2;
src += 2;
dst += 1;
}
}
With -O2 -pipe:
Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113).
orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855).
With -O3 -pipe:
Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386).
orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412).
I.e. -O3 with the simplest possible implementation slightly beats your
hand-optimized loop here. probably because the compiler was smart enough
to insert some SSE2 stuff automatically.
The above should not be counted as an objection to your patch. We can
always clean up this and the existing hand-rolled code later.
Now waiting while clang-3.4 compiles...
--
Alexander E. Patrakov
More information about the pulseaudio-discuss
mailing list