[gst-devel] pulsesink optimizations

Thu Oct 15 20:40:01 CEST 2009

On Thu, Oct 15, 2009 at 1:58 PM, Wim Taymans <wim.taymans at gmail.com> wrote:
> On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote:
>> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <mail at renestadler.de> wrote:
>> > pl bossart wrote:
>> >> That's not necessarily an option. There are 3rd party decoders out
>> >> there whose code is not necessarily public. And fixing the decoders is
>> >> somewhat odd when the real problem is the sink...
>> >
>> > The sink is not perfect, but the decoder situation also need work. Current
>> > decoders chose the output buffer sizes themselves, and this is wrong. Yes
>> > you could change the sink and stitch these buffers together using pad_alloc,
>> > but the fact remains that the decoder picks the size and therefore decides
>> > on the overhead up to the sink (and all processing elements between decoder
>> > and sink).
>> >
>> > This became apparent to me when Felipe profiled OggVorbis playback with a
>> > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount
>> > of time pushing GStreamer buffers around compared the actual audio decoding.
>> > And this on the N900, which shows exactly that the current situation is
>> > complete nonsense for a battery-powered device.
>>
>> Indeed. I profiled the audio pipeline and 30% of the time was spent on
>> the decoder, the rest was spent pushing buffers around. When I
>> increased the buffer sizes pushed by the decoder (128k) efficiency
>> increases, now the time spent is 45%, but still, 55% CPU time spent
>> pushing buffers around is unacceptable.
>
> It's now also possible to push multiple buffers at the same time by
> using the buffer lists. I don't know if that solves anything here, I
> guess it depends on the amount of encoded frames that a decoder
> receives.
>
> We could for example change the ogg demuxer to push all packets in a
> page in a bufferlist and make the vorbisdecoder decode the complete list
> before pushing the list of samples to the sink. This would reduce the
> amount of objects that get pushed around between elements.

I don't think that would help. What we need is to push big buffers to
pulsesink, regardless of what we receive on the input. It's not a big
problem for the decoder to fill some temporary buffer.

> Also, pushing buffers should not be as expensive as 55%, I don't know
> what's happening there, maybe the gobject allocation is what's slowing
> it down (patches exist for glib) ? maybe the locking/refcounting is slow
> (should be simple atomic operations on N900..)? maybe the typechecking
> (also patches exist for glib)? It would be nice to know what's causing
> this overhead in more detail.

It seems to me that allocating and unrefing buffers is taking much
more than expected:
http://people.freedesktop.org/~felipec/profile/mp3-1.png

> With newer pulse we can't really use buffer_alloc to write directly into
> the pulse shared memory from the decoder because there is no api to
> allocate such a chunk, there is pa_stream_begin_write() but that can
> only be called once AFAIK.

I think memcpy is the least grave of the problems right now.

-- 
Felipe Contreras