[gst-devel] JACK and GStreamer, from the horse's mouth

Tue Nov 28 22:14:49 CET 2006

On Mon, 27.11.06 09:51, Paul Davis (paul at linuxaudiosystems.com) wrote:

Hi!

Just my 2¢ on this discussion, as the maintainer of PulseAudio:

(I am not a member of this ML, so please keep me in CC)

> > If I were
> > just writing my little app, I'd not even know what that means and would
> > have given up by then.
> 
> if you were writing your little app using an API *like* JACK (no format
> negotiation, no rate negotation, no API negotiation), you wouldn't have
> to know what it meant, and you would already be done :)

While a Jack-like API makes sense for applications which Jack was
designed for (i.e. "pro" audio, audio processing), it definitely does
not make any sense as a general purpose API for all use cases.

Please remember that not all machines running Linux and GStreamer have
a FPU. And even if they have one, it might be quite a bit slower than
integer processing in the CPU. Just think of the Nokia 770. Requiring
conversion from and to FP for every sample played is not option on
these machines. The same is true for sample rate handling. Since
sample rate conversions are quite computation intensive this might
hurt even more. Media files come in different formats, with different
sample rates. At some place a conversion has to happen. Pushing that
conversion job unconditionally into the client seems to be a little
bit too simple in my eyes. 

> > > moreover, these open/read/write/close apps cannot be ported to any API
> > > that uses a callback model: i.e. every API under the sun except the
> > > linux hacks based on an open/read/write/close model.
> > >
> > > if open/read/write/close is so wonderful (and i have to admit to being
> > > quite a defender of it in the past), why isn't it used for video?
> > >
> > I don't care a single bit if it's pulling, pushing, sneezing or otherwise
> > doing stuff, as long as it's dead simple to use. Let me repeat that: It
> > doesn't matter how perfect your application is, unless getting started is
> > really simple.
> 
> my point is that this claim isn't true for any video or graphics API, at
> least no more so than ALSA, and arguably much less so. there is no
> open/read/write/close API for interacting with h/w video adapters
> because they fundamentally don't work - the device has various
> characteristics that most programmers want abstracted away.
> Consequently, people use APIs like Qt, GTK, Cocoa and so forth. When
> they want to draw, they use APIs like Cairo or GDI or OpenGL. None of
> these APIs are remotely simple to use but this doesn't seem to have
> stopped the development of application software that needs to interact
> with h/w video adapters.

The pull model for audio playback creates as many problems as it
solves. It makes some things easier, but others more
complicated. Since it requires threading in a way, it also requires
some sort of synchronisation. Handling synchronisation in a way that
it doesn't hurt in low-latency situations and that is easily
understandable with all its effects is difficult.

Video4Linux is an open/ioctl/read/write/close-API for video, btw.

> it appears to me that because sound cards spent a long time stuck in an
> amazingly simple world - 1 data format (16 bit interleaved stereo at
> 44.1 or 48 kHz) and one h/w API (the SB16) - lots of people think it
> that playing back audio should always be simple. by contrast, the APIs
> that make interacting with the video adapter simple have been evolving
> over a *long* period of time and are still evolving (e.g. Cairo), and
> nobody expects that there is some 3 line method to blit a PNG to a
> specific part of the screen. but people continue to expect that even if
> the user has 512 channels of ethernet-based I/O as their audio
> interface, the same 3 lines that work for some consumer-level onboard
> chipset should somehow still make a "system beep".

Forcing everyone into one sample format has many more drawbacks,
btw. Just thing of AC3 pass-through/SPDIF. For these outputs movie
audio tracks should not be touched when passed through the audio
layer. Compressing/uncompressing/resampling them would increase
CPU/FPU load immensly and have a negative effect on quality. In
addition this would not exactly be the "light-weight" solution that
makes people feel warm.

> > >         b) forces all participants to use a single data format which
> > > happens to be the format used by 99% of all audio programmers in the
> > > world (32 bit floating point). the 1% that don't are generally writing
> > > consumer/desktop multimedia on linux. the rest - all plugin developers,
> > > all DAW developers and the rest - have realized that the benefits of
> > > using 32 bit float outweighs any inconvenience. and by removing format
> > > negotiation from graph entry, applications are actually much easier to
> > > write.
> > >
> > Surprisingly, neither my CD player, my MP3 decoder, nor my sound cards
> > seem to have had contact with audio programmers. Or maybe you live in a
> > different world than I do.
> 
> neither your CD players nor your sound cards run general purpose audio
> software. Go take a look at Windows and MacOS and OS X, where 90+% of
> the world live (for better or for worse), and where almost all the cool
> (media) app development has been taking place for the last 10-15 years.
> The overwhelming majority of that stuff is done using APIs where
> callbacks and a single data format are the norm, not the exception.
> DirectX, WDM, ASIO, are just a few of the windows acronyms for this. To
> be fair, DirectX does include format negotiation in a style somewhat
> similar to GStreamer, and while its appreciated by desktop developers on
> windows for the same reason that gstreamer looks cool, the pro audio
> developers prefer to avoid it.

It strikes me very gutsy to claim that the Windows API is an
example of good API design. 

On Unix programs have traditionally been single-threaded and organised
around a single poll()/select() event loop. In contrast on Windows I/O
multiplexing is usually done with threads. Hence I would argue that
those threaded, pull-based APIs are more a reminiscence to where they
came from, then a feature of good API design.

Please understand that I don't think that pull-based/threaded APIs are
necessarily a bad idea. Right the contrary: you need to push audio
processing into a separate thread, instead of running it in the normal
main loop - for latency reasons. However, I would leave the control of
this to the programmer instead of the API designer.

Please remember that the the traditional Unix
open/ioctl/read/write/select/close way to do things is more powerful
then the pull-model. Why? Because you can easily implement the
pull-model on top of the Unix model, just by running the following
trivial function as a thread:

pull_thread() {
    for(;;) {
        audio_data = call_pull_function();
        write(audio_dsp, audio_data);
    }
}

I've been thinking about audio APIs quite a lot in the last
months. I've seen quite a lot of different APIs, from different sound
servers, hardware interfaces, operating systems. I didn't find a
single one that would make all the uses cases I had on my list happy -
and would be easy to use.

In PulseAudio we currently have an API which tries to find a
middle ground between the pull model and the open/read/write/close
model, one that blurs the lines between both models. While this is a
very, very powerful API it is also a very complicated API. And in my
eyes, it is too complicated and bulky. That's why we currently do not
recommend anyone to make use of our API. I am little uncertain how to
pursue the quest for a stable API for PulseAudio. Right now my
position is this: there is no perfect API, just a bunch of accepted
APIs we can be compatible with. Right now these are the ALSA, OSS and
maybe libao.

OSS is a deadly simple API, that is widely accepted and is the only
one that is widely accepted *and* cross-platform. It is here to
stay. It's the least common denominator of Unix audio APIs. However,
It has a few drawbacks. Firstly it's very difficult to emulate,
because it is a kernel API. (Monty from Redhat is now working on FUSD,
a way to emulate character devices in userspace, which should finally
clean this up) Secondly it doesn't support all those new features pro
audio cards support.

The ALSA API is in a way awful because it is complicated and a little
bit too verbose. On the other hand it is very powerful and supports
almost everything you could think of when doing audio. 

> > > so presumably you think that CoreAudio, whose API is twice the size of
> > > JACK's and involves massively more complexity, also sucks?
> > >
> > Considering I wasn't able to find a 10-line code file to output an audio
> > stream with CoreAudio in my 5 minutes of googling, I'd say that it does
> > indeed suck.
> > Btw, for ALSA the first link was this:
> > http://equalarea.com/paul/alsa-audio.html#playex
> > which, apart from the huge lot of setup commands, is reasonably short to
> > get started.
> > 
> > 
> > I'm not trying to tell you that there's any technical reason why JACK is
> > bad.
> > The only thing that sucks about JACK is that getting started is extremely
> > complicated compared to lots of other sound outputs. 
> 
> the canonical example app for JACK was, against my wishes, "corrupted"
> to include transport control, which is very regrettable. the older
> version, and its even simpler twin, jack_monitor, which copies its
> inputs to its outputs, is half the size of the ALSA example above. 

Just for the record: *I* do like the JACK API. It's simple and easy to
use. It's very nice for audio production. However it's not a a good
general-purpose audio API. Far from that.

> > And as long as that
> > is the case, noone will take JACK as the first choice for audio output
> > unless they write their app for JACK.
> 
> i've said over and over that i don't want desktop media to write for
> JACK. i've been at DAM meetings telling people that gstreamer and/or
> pulseaudio are much better ideas. but as i've also said, i want people
> to use a callback ("pull") style API because it is the *only* model that
> can unify the pro and desktop media worlds. and yes, that involves a
> little bit more background - you have to understand what a callback is.
> this doesn't seem to have impeded dozens or even hundreds of mostly very
> inexperienced OS X freeware developers from cooking up neat little apps,
> and i don't see why it should be a burden on linux.

I am sorry, but I don't follow you on this. As i showed above, the
pull model can easily implemented using the traditional unix way to do
things. Claiming that the pull model is the "only" model to marry pro
and desktop media worlds is simple nonsense.

Instead I would argue that the pull model actually has negative
effects. It is very difficult to marry a system like Gstreamer (which
already does its own threading) with the way threading is handled by
the pull API. Why? Because you need to pass the audio data between the
gst-handled thread and the api-handled thread. That passing requires
buffering. Thread-safe buffering adds latency and complexity. (and
remember that systems like JACK stress the fact that their buffer is
"lock-free", a feature which is not worth much if you need to add more
buffering and sychronoization because of the pulll model) Both are
easy to avoid if the API would be unix-style with
open/read/write/select/close.

It's not clear to me what the OSDL tries to achieve with the DAM
multimedia meetings? Is the plan to define a new abstracted sound API?
That would be a very difficult thing - and a very questionable one as
well - since there are so many APIs around already. I wonder who's the
intended target group for this new API? Pro-Audio people? Game
developers?  Networked Audio people? "desktop" application
programmers? all of them?

Anyway, I can't afford attending the DAM conference, and I haven't
been invited. (which surprises me actually, since PulseAudio is slowly
become the standard audio system on Linux, given that both Ubuntu and
Redhat seem to support it nowadays) I'd have a lot to say about audio
APIs and standardizing on them, though. But mabye gstreamer-devel is
not the proper place for this.

> > PS: The pull model used by non-blocking file descriptors is perfectly
> > fine. ESD and OSS support that model.
> 
> but they do not require it, and because they support
> open/read/write/close and because this perceived to be easier by so many
> *nix developers, we end up with two worlds - apps written around
> pull-requiring APIs and apps based on a push model. 

I wonder where there problem with that is in your eyes? Some apps do
it this way, others do it a different way. So?

Lennart

-- 
Lennart Poettering; 0poetter [at] informatik [dot] uni-hamburg [dot] de
ICQ# 11060553; GPG 0x83E23BF0; http://www.stud.uni-hamburg.de/users/lennart/