[gst-devel] How to decrease CPU consumation for audio recording?

Thu Oct 7 19:56:33 CEST 2010

> > My claim was that GStreamer was bad for small buffers; the smaller,
the worst. That IMO is a fact.

Is it fair to say that this discussion really has nothing to do with the
actual size of the buffers, but is really a matter of per-buffer
overhead? 

> Felipe's diagrams are clearly showing that the degradation is O(e^n)

Actually, that's not clear to me, as his plot was log(x), y.  That's why
I asked about plotting throughput vs number of elements or queues.  Even
using a linear x axis would be more enlightening.

I also agree with Wim that the effects of the queue are exaggerated in a
trivial pipeline on an idle system.  In higher-load situations, you
would tend to have fewer context switches, which are probably the
largest cost.

I think a lockless queue wouldn't help with this scenario, since you'd
still want to wake up a consumer that's waiting on an empty queue (which
requires a lock + condition variable).  Where lockless helps is to scale
throughput in higher load scenarios.

If you could afford some latency, then perhaps batching could be
implemented by having the consumer block until the queue either reaches
some watermark or a timeout expires.  When either of these conditions is
met, the consumer empties out the queue and goes back to waiting.

Matt

________________________________

From: Marco Ballesio [mailto:gibrovacco at gmail.com] 
Sent: Thursday, October 07, 2010 13:21
To: Discussion of the development of GStreamer
Subject: Re: [gst-devel] How to decrease CPU consumation for audio
recording?

Hi,

On Thu, Oct 7, 2010 at 6:56 PM, Felipe Contreras
<felipe.contreras at gmail.com> wrote:

..snip..

My claim was that GStreamer was bad for small buffers; the smaller, the
worst. That IMO is a fact. Now, how small, and and how bad GStreamer is
depends on your system, my guess was that ARM was
specially worst compared to x86. I think the numbers show that.

In the uncountable times I've been profiling the VoIP (and video) call
on arm I found a perfect match with Felipe's finding: the smaller the
buffers, the higher the overhead on the system. In the pipelines of
telepathy-stream-engine, where imho there's plenty of unneeded elements
(for instance, we don't need resampling/converting the audio buffers,
but there are always at least two audio converters and one resampler)
the change of CPU load between 60ms to 20ms packetisaztion is about 20%
(try with Skype to believe), mostly located into the kernel, but also
inside the udpsink/udpsrc and rtpbin. Maybe I could add a few diagrams
to Felipe's once I retrieve my data, but I've some interesting
considerations in the meanwhile..

Now, in a perfect world the overhead generated from GStreamer when
handling audio data should be O(n) wrt the amount of data, and O(1) wrt
its packetisation. Since we know that (de)payloading is an expensive
operation, I could still understand an algorithm which degrades with
O(n) with the number of buffers, but Felipe's diagrams are clearly
showing that the degradation is O(e^n) which grows faster than any
polynomial function and, as they teach at the university, is bad (and
Felipe's fiagram don't have neither payloaders nor rtp elements).

	My "overly dramatic" graphs show the raw data for the most
minimal example I could find, so it doesn't matter what you do, you'll
get _at least_ that performance hit. On real use-cases (in the graph
after 2^7), IMO the performance lost is already bad, but you have to
	multiply that by the amount of different elements and thread
contexts that are used.

Just to confirm this, I'd like to publish a mean stream-engine audio
pipeline and the CPU growth with different packetisations. Again, I hope
to be able and take a few pictures from the laptop @ work.

As it appears the most of the CPU growth is in the kernel (which doesn't
seem to happen on x86) I believe something weird is going on with fast
futexes on ARM. That is: the less mutexes, the less exponential CPU
growth.

	However, the empirical experience is already there, ask
	anyone in Nokia, I just wanted to show raw numbers.

:)

	> It sounds like when you mean size, you really mean duration
and thus the
	> amount of buffers per second.
	>
	> GStreamer is not designed to pass around 1 sample per buffer
(that would
	> be typically 48000 buffers per second), you can do it but it
will incur
	> a higher overhead that increases with the amount of elements
in the
	> pipeline.

see my comments above: do you really think O(e^n) is a reasonable
growth?

	>
	> GStreamer is however designed for more realistic buffer
durations of
	> 10ms (that's 100 buffers per second). The overhead that
GStreamer causes
	> in these types of pipelines depends on a lot of things, but in
well
	> designed pipelines you typically see overhead values of around
1% or
	> less (callgrind and kcachegrind are good tools to measure
this).

The growth Felipe is showing happens as well with stream-engine
pipelines, and a similar one has been measured with quite simpler ones,
like the examples on:

http://www.gstreamer.net/data/doc/gstreamer/head/gst-plugins-good-plugin
s/html/gst-plugins-good-plugins-gstrtpbin.html

modified for audio-only and e.g. using g711 alaw. You can even test it
with g729 on any architectures now ;) 

..snip..

	>
	> As a datapoint: On my desktop I can push around 700000 buffers
per
	> second, and that's then using 100% CPU (and also 100%
gstreamer
	> overhead). (gst-launch fakesrc num-buffers=7000000 silent=1 !
fakesink
	> silent=1 takes about 10 seconds).

It appears ARM is not as much optimised as x86 wrt fast futexes (no
references here :\, I have to dig more..) this meaning that GStreamer is
not well optimised for that architecture. It would be interesting to
propose an alternative way for read/write conflicts than bare mutexes.

	On my laptop:
	% gst-launch fakesrc num-buffers=7000000 silent=1 ! fakesink
silent=1
	22s

	% gst-launch fakesrc num-buffers=7000000 silent=1 ! queue !
fakesink silent=1
	45s

	On my N900:
	% gst-launch-0.10 fakesrc num-buffers=7000000 silent=1 !
fakesink silent=1
	4m 26s

	% gst-launch-0.10 fakesrc num-buffers=7000000 silent=1 ! queue !
	fakesink silent=1
	16m 11s

This is more or less an experimental confirmation of my statements above
on ARM vs x86.

Regards

	Cheers.

	--
	Felipe Contreras

------------------------------------------------------------------------
------
	Beautiful is writing same markup. Internet Explorer 9 supports
	standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 &
L3.
	Spend less time writing and  rewriting code and more time
creating great
	experiences on the web. Be a part of the beta today.
	http://p.sf.net/sfu/beautyoftheweb
	_______________________________________________
	gstreamer-devel mailing list
	gstreamer-devel at lists.sourceforge.net
	https://lists.sourceforge.net/lists/listinfo/gstreamer-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/gstreamer-devel/attachments/20101007/324febc1/attachment.htm>