[gst-devel] pulsesink optimizations

Thu Oct 15 12:58:41 CEST 2009

On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote:
> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <mail at renestadler.de> wrote:
> > pl bossart wrote:
> >> That's not necessarily an option. There are 3rd party decoders out
> >> there whose code is not necessarily public. And fixing the decoders is
> >> somewhat odd when the real problem is the sink...
> >
> > The sink is not perfect, but the decoder situation also need work. Current
> > decoders chose the output buffer sizes themselves, and this is wrong. Yes
> > you could change the sink and stitch these buffers together using pad_alloc,
> > but the fact remains that the decoder picks the size and therefore decides
> > on the overhead up to the sink (and all processing elements between decoder
> > and sink).
> >
> > This became apparent to me when Felipe profiled OggVorbis playback with a
> > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount
> > of time pushing GStreamer buffers around compared the actual audio decoding.
> > And this on the N900, which shows exactly that the current situation is
> > complete nonsense for a battery-powered device.
> 
> Indeed. I profiled the audio pipeline and 30% of the time was spent on
> the decoder, the rest was spent pushing buffers around. When I
> increased the buffer sizes pushed by the decoder (128k) efficiency
> increases, now the time spent is 45%, but still, 55% CPU time spent
> pushing buffers around is unacceptable.

It's now also possible to push multiple buffers at the same time by
using the buffer lists. I don't know if that solves anything here, I
guess it depends on the amount of encoded frames that a decoder
receives.

We could for example change the ogg demuxer to push all packets in a
page in a bufferlist and make the vorbisdecoder decode the complete list
before pushing the list of samples to the sink. This would reduce the
amount of objects that get pushed around between elements. 

Also, pushing buffers should not be as expensive as 55%, I don't know
what's happening there, maybe the gobject allocation is what's slowing
it down (patches exist for glib) ? maybe the locking/refcounting is slow
(should be simple atomic operations on N900..)? maybe the typechecking
(also patches exist for glib)? It would be nice to know what's causing
this overhead in more detail.

With newer pulse we can't really use buffer_alloc to write directly into
the pulse shared memory from the decoder because there is no api to
allocate such a chunk, there is pa_stream_begin_write() but that can
only be called once AFAIK.

Wim

> 
> Profiling what happens on pulseaudio side is an exercise I haven't
> done yet, but as René said, my guess is that there's some ideal buffer
> size that pulseaudio would like to receive from the decoder that's big
> enough for GStreamer not to choke on it.
> 
> Removing the queue from the decoder to the sink would also help to
> avoid the unnecessary overhead of mutex contention, specially on small
> buffers.
> 
> Ideally I guess the sink should be able to receive small buffers
> without performance penalty, but currently that doesn't seem to be
> case.
> 
> Cheers.
>