ORC: no way to accumulate 64 bit (8 bytes)?

Wed Jul 6 14:18:53 UTC 2016

Hi Sebastian.

Thanks for answering.

On Wed, Jul 6, 2016 at 8:13 AM, Sebastian Dröge <sebastian at centricular.com>
wrote:

> On Di, 2016-07-05 at 15:43 +0200, Peter Maersk-Moller wrote:
> > But to no avail. So I can declare an 8 byte accumulator, I just can't
> > accumulate in it? Is that the case?
> There's no 64 bit accumulator opcode, correct:
> https://gstreamer.freedesktop.org/data/doc/orc/orc-opcodes.html
>
> accw, accl and accsadubl are the only ones currently. Adding new ones
> shouldn't be that much effort though, as long as it can be implemented
> at least for SSE and NEON.
>

It ought to be trivial, however it might not provide any speedup. The devil
is in the details.

That said, it it possible to emulate the 8 byte accumulator. Here is an
example. The original C-code (simplified - no checks) is this where
*buf->rms[i]* are unsigned 64 bit integer and the result is RMS squared
(ie. you need to take the square root):

void MakeRMS(audio_buffer_t* buf) {
        u_int32_t samples_per_channel = buf->len /
                  (sizeof(int32_t) * buf->channels);
        for (u_int32_t i=0 ; i < buf->channels; i++) {
                buf->rms[i] = 0;
                int32_t* sample = ((int32_t*)buf->data) + i;
                for (u_int32_t j=0; j < samples_per_channel ; j++) {
                        buf->rms[i] += ((*sample)*(*sample));
                        sample += buf->channels;
                }
                buf->rms[i] /= samples_per_channel;
        }
}

In Orc, where buf->channels == 1, the inner loop can be replaced with this
Orc function (on Little Endian Hardware) taking into account that each
sample is 16 bit signed integer values in a signed 32 bit integer

.function audio_rms_orc_one_channel
.source 4 src int32_t
.accumulator 4 lowres
.accumulator 4 highres
.temp 4 squared
.temp 2 low2
.temp 2 high2
.temp 4 low4
.temp 4 high4
mulll     squared src src
select0lw low2 squared
select1lw high2 squared
convuwl   low4 low2
convuwl   high4 high2
accl      lowres low4
accl      highres high4

Then the squared RMS value can be calculated as

buf->rms[0] = (low_rms +(((u_int64_t)high_rms)<<16))/samples_per_channel;

Howerver, this Orc code is slower 9 out of 10 times when calculated on 2048
samples arrays and measured with gettimeofday() (not the optimal way - I
know - but it gives you a hint with certain limitations) on an older dual
core laptop. So developing a RMS function for multiple channels
interleaved, has kind of no purpose. Of course if most of these commands
could be replaced by a SSE/NEON instruction saving a 4 byte integer to an 8
byte accumulator, timing might improve ... maybe ...

Anyway, does GStreamer implement Orc code for audio manipulation and if
yes, have you measured that it is actually worth it? I tried to see if
GStreamer has an RMS module, but it appear that it does not (or I just
haven't looked close enough).

Best regards
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/gstreamer-devel/attachments/20160706/c8b9a9b0/attachment.html>