[pulseaudio-discuss] [PATCH 11/11] remap: Add stereo to mono and 4-channel special case remapping

Sat Apr 26 05:46:28 PDT 2014

26.04.2014 18:19, Alexander E. Patrakov wrote:
> 24.04.2014 22:09, Peter Meerwald wrote:
>> From: Peter Meerwald <p.meerwald at bct-electronic.com>
>>
>> The generic matrix remapping is rather inefficient; special-case code
>> improves performance by 3x easily.
>
> I have looked at this and the 10th patch. For 10/11, I have no
> objections. 11/11 definitely works and improves things, but...
>
>> +static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst,
>> const int16_t *src, unsigned n) {
>> +    unsigned i;
>> +
>> +    for (i = n >> 2; i > 0; i--) {
>> +        dst[0] = (src[0] + src[1])/2;
>> +        dst[1] = (src[2] + src[3])/2;
>> +        dst[2] = (src[4] + src[5])/2;
>> +        dst[3] = (src[6] + src[7])/2;
>> +        src += 8;
>> +        dst += 4;
>> +    }
>> +    for (i = n & 3; i; i--) {
>> +        dst[0] = (src[0] + src[1])/2;
>> +        src += 2;
>> +        dst += 1;
>> +    }
>> +}
>
> Why are we doing the compiler's job here? Yes, I understand that there
> are precedents of manually unrolling the loop here, but this actually
> slows things down with -O3 on gcc-4.8.2! Here are my results regarding
> stereo to mono s16ne conversions with different CFLAGS on an amd64
> machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo
> Boost).
>
> The tests below are with the cpu-test rework patches applied (but not
> reviewed).
>
> With -O2 -pipe, and your code, I get:
>
> Checking special remap (s16, stereo->mono)
> Forced to use generic matrix remapping
> Using stereo to mono remapping
> Testing remap performance with 3 sample alignment
> func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442).
> orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169).
>
> With -O3 -pipe, and your code, I get:
>
> Checking special remap (s16, stereo->mono)
> Forced to use generic matrix remapping
> Using stereo to mono remapping
> Testing remap performance with 3 sample alignment
> func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987).
> orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764).
>
> Now let's test this:
>
> static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst,
> const int16_t *src, unsigned n) {
>      while (n--) {
>          dst[0] = (src[0] + src[1])/2;
>          src += 2;
>          dst += 1;
>      }
> }
>
> With -O2 -pipe:
>
> Checking special remap (s16, stereo->mono)
> Forced to use generic matrix remapping
> Using stereo to mono remapping
> Testing remap performance with 3 sample alignment
> func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113).
> orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855).
>
> With -O3 -pipe:
>
> Checking special remap (s16, stereo->mono)
> Forced to use generic matrix remapping
> Using stereo to mono remapping
> Testing remap performance with 3 sample alignment
> func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386).
> orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412).
>
> I.e. -O3 with the simplest possible implementation slightly beats your
> hand-optimized loop here. probably because the compiler was smart enough
> to insert some SSE2 stuff automatically.
>
> The above should not be counted as an objection to your patch. We can
> always clean up this and the existing hand-rolled code later.
>
> Now waiting while clang-3.4 compiles...
>

Simple code, clang, -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 82375 usec (avg: 823.75, min = 794, max = 1387, stddev = 90.7334).
orig: 134835 usec (avg: 1348.35, min = 1263, max = 2151, stddev = 110.471).

Simple code, clang, -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 80987 usec (avg: 809.87, min = 794, max = 1016, stddev = 30.6149).
orig: 130819 usec (avg: 1308.19, min = 1287, max = 1507, stddev = 38.0144).

Your code, -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 63764 usec (avg: 637.64, min = 615, max = 946, stddev = 39.2402).
orig: 132069 usec (avg: 1320.69, min = 1302, max = 1658, stddev = 45.6867).

Your code, -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 61143 usec (avg: 611.43, min = 598, max = 801, stddev = 32.9057).
orig: 130071 usec (avg: 1300.71, min = 1286, max = 1641, stddev = 43.0877).

OK, so on clang your code has its benefits. Keep it.

-- 
Alexander E. Patrakov