[pulseaudio-discuss] Re : Effects of Clock Resolution on Pulseaudio
rextanka at comcast.net
Wed Jul 30 15:00:30 PDT 2008
On Jul 30, 2008, at 12:08 PM, Lennart Poettering wrote:
> On Tue, 22.07.08 17:13, Nick Thompson (rextanka at comcast.net) wrote:
>>>> So i am not sure what part of Pulseaudio is causing high CPU
>>> Hmm, could you do some profiling then? Just the most basic. I.e.
>>> functions take up most CPU.
>> I'll add more detail in a bit when we get to 0.9.10, h
>> seems to be a big hitter on arm systems:
> Hmm, that function is not optimized in any way, but if I look on its
> sources doesn't appear that slow to me either. For each sample we do
> one multiplication, one shifting, we appy saturation and then we
> increase/decrease poinetrs with wrap around. That shouldn't be that
> bad. Also, this code goes once linearly through all samples, which
> minimize influence of the cache.
Yeah the problem seems to be that ARM has a limited number of
registers and gcc does not deal with monolithic code that well, where
as x86 will have no issues in dealing with a large case statement with
a number of loops in it. A look at the gcc output indicated a number
of load instrs in the loop which is very expensive on ARM (3 cycles).
A co-worker has been working on some arm assembly and factoring the
loops out into separate functions, and the net result of this is that
we see 4-6% total, which is much better. We need to look at the mix
and rate convert cases too, the mix is more complicated but we should
see something similar there.
Also I want to look at Kevin's emails and see if we can build on that,
it would be good to get that working on a couple systems.
Patches will be forthcoming for this, however we are still on 0.9.8, I
am hoping we'll have made the move to 0.9.10 this week, so I hope we
can send you stuff in a couple weeks once we are happy they are tested
well. The patches will be for 0.9.10 so help in getting that merged
with the latest would be handy, though I'd suspect these routines are
not so much in flux at the moment?
> I assume the data processes is S16NE and the CPU is LE?
Yup. That's the path we started with the optimization.
> Hmm, can you figure out in which context this is called that often? (i
> mean, pa's audio memory management should be mostly zero-copy, so
> having such a big hit on memcpy here is surprising to me.
>>> 277 3.5951 libm-2.5.so __adddf3
> This is interesting, could you figure out the context?
Yup, I need to patch our kernel again to get call trace with oprofile
on my device, so hopefully I can find some time to get context for this.
With regards to the vectorization stuff, that can be used, although it
would make the arm code very specific to a certain subset of ARM
implementations. It brings a philosophical question, since I'd
suspect a generic ARM implementation is a better open source solution,
having the optimized cases for cortex-a8/NEON processors would be
useful, but it would add to the build complexity, and potentially
would be frustrating for someone with a different ARM processor. I'm
not sure I understand open source well enough to decide this. That
being said we'll probably optimize for our case and that any patches
will likely be somewhat system dependent. Worst case is it gives
someone else something to build on, I guess.
More information about the pulseaudio-discuss