[gst-devel] pulsesink optimizations

Fri Oct 16 03:39:39 CEST 2009

On Wed, 14.10.09 14:44, pl bossart (bossart.nospam at gmail.com) wrote:

> Hi folks,
> I noticed performance issues due to the rewrite of pulsesink since the
> 0.10.15 release. The degradation is in the 30% range on my Atom board
> when playing MP3/AAC. There have been a couple of modifications in git
> related to buffer attributes and latency settings, but overall the
> overhead remains, and the pulsesink code could be further optimized
> for low-power playback apps that don't care about latency.
> 
> I finally took the time to look at the code and check what was going
> on. It seems that the overhead is mainly due to the granularity of
> transfers between pulsesink and PulseAudio. What happens is that the
> sink waits for space available in the PulseAudio buffer. When PA
> requests data in a callback, the mainloop unblocks and the sink writes
> its PCM to PulseAudio. The problem is that the sink will not try to
> fill the whole buffer before handing-off the data to PulseAudio. For
> example, say PulseAudio requests 100k (as defined by minreq) and you
> are doing MP3 decode, you are going to send one frame (4608 bytes) at
> a time to PulseAudio until the 100k have been filled. That's a lot of
> overhead. It would be a lot more efficient power-wise to decode and
> store as many frames as possible into the PA buffer before calling
> pa_stream_write().

This is mostly correct. But actually finding the right buffer sizes to
send to PA is a science of its own.

If you have to fill a 2s buffer and you calculate audio for that all
in one step and send it in one packet to PA then you might have to do
some CPU intensive work for quite some time (e.g. decoding AC3) during
which PA might run out of data to play. Which might become a
problem. So the general rule is to do send packets as big as possible
but not to block for that for too long. This is of course a very
imprecise definition.

Also, for optimizing the data tranfer via SHM you shouldn't use memory
blocks larger than 64k right now (actually a little less), which is
the SHM tile size. I probably should export that value in libpulse in
some way, so that the clients can optimize for it, and pass blocks of
size MIN(pa_stream_get_writable_size(), pa_context_get_tile_size()) or
so. 

I'll add that in the next release. And I think that block size would
be a good value to optimize the writes for. Unless one starts counting
CPU cycles finding the perfect block size is not possible anyway.

> I have snippets of code as a proof of concept. I don't mind releasing
> the code, but I must admit this is a hack and does not cover all the
> cases pulsesink addresses. An additional optimization could consist in
> passing the PulseAudio buffer upstream to avoid memory copies. The new
> PA release provides support for this with pa_stream_begin_write(). In
> short, I would badly need a review from more experienced developers...
> If anyone is interested let me know.

In fact I added the _begin_write() stuff specifically for use in
GStreamer, after a talk the Gst folks and I had at last FOSDEM.

Lennart

-- 
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4