[gst-devel] How co-routines affect things (long)

Thu Mar 23 10:39:52 CET 2000

This message is intended to get everyone up to speed on the co-routines as
they will be put into CVS in the near future.  There are a lot of benefits
and things that are now possible due to the addition of these things, but
I've written up a lot of the reasoning behind them as well.  This should
probably end up in the developer documentation at some point.  It's long,
but I think it's required reading for anyone hacking the core code
(gstreamer/gst/*.[ch]).

-----
First, a quick explanation of The Way Things Were(tm):

To actually run the pipeline, one would call an _iterate() function,
either provided by the bin or thread.  More specifically, the thread would
call its own _iterate() in a loop once set RUNNING && PLAYING.  During
plan generation it would have discovered all elements that satisfy one of
the following criterea:

a) GstSrc-derived
b) has a sink pad that's wired to an element outside the bin (i.e. queue)
   (in this case the element recorded is the one outside, i.e. the queue)

These are called 'entries' into the Bin, and are the scheduling drivers.
Of course, behaviour os strictly undefined if there is more than one of
these, and that's why things have changed...

The _iterate() function would then go through the list of entries and call
their respective push() or equivalent function, meaning eiter
gst_src_push() or gst_connection_push() (remember, the outside element is
the entry in that case), in either case causing the push() function of the
element to fire.  In said push() function some work is done (read from
disk, pull from queue, whatever) and a gst_pad_push() is called.

At this point the chain() function for the peered pad is called, which has
the effect of calling the functional code for the next element in the
pipeline, which just happens to create the 'pipeline' effect, hence the
point of this whole description ;-)

Now, consider the mutlitide of cases where that whole model just doesn't
cut it:

1) elements that can't respond to chain() calls but must pull their data
   (like a real bitstream-based element)
2) elements with more than one input or output

It turns out that these are more common than you might want to think.  The
whole OGI pipeline structure is built around the loop model, where the
life of an element is while(1) {pull;process;push}.  The various
system-stream parsers cause the problem of potentially causing large
amounts of work serially if stuff is hooked to them directly.

Obviously, you can solve this by putting thread boundaries between such
misbehaved elements, at least in the mux/demux case.  But there are
problems with even that, as we've found at OGI.  Co-routines were put in
the OGI pipeline to deal with the problem that several of the elements
were barely touching the CPU, and thus causing the huge OS-level overhead
of switching these things constantly.  Merging them into a single
schedulable entity solves the problem, because typically there's a larger
element close by to group it with, so scheduling overhead reduces
drastically.

So, the solution is co-routines (also called cothreads, same thing). 
First co-routines are a simple user-space method for switching between
subtasks.  They're based on setjmp()/longjmp() in their current form,
though I've heard that there are other (more machine-specific) methods
that are faster.  Basically, setjmp() saves the current stack frame, PC,
and so on to a structure, and longjmp() switches to one of these stack
frames.  That means that you save the stack for your current context and
promptly switch (fork()-style) to some other context.  You get returned to
just after the setjmp(), hence the fork()-style check before longjmp()'ing
again, unless you like infinite loops ;-) 

As they are implemented in GStreamer, the whole of the work is done in the
Bin.  GstThread pretty much goes along for the right by not overriding
things, which is the way things are supposed to be.  The Pads help a lot,
but are unaware of the actual mechanism.

The changes at the Pad level consiste of a buffer pen and a function
pointer (two currently, but that'll be fixed).  Basically, there's a push
and a pull function (this also conflicts with backwards buffer passing,
but I'll figure something out) that's used for all transfer operations.

In the _push() case, the buffer is placed in the peer's pen and the push() 
function pointer is called.  This is assumed to do something that
transfers the buffer to the peer, one way or another.  The _pull() case is
reversed and conditional.  If there's a buffer in the pen, grab it.  If
not, call the pull() function pointer, then grab it.

When we move up a level into the Bin, we first come to the plan
generation.  The first thing done is to create the global cothread
context, and then state for each of the elements.  All the functionality
is provided by a generalized library of sorts in cothread.[ch], which is
where all the work of making it portable must be done (different chip
architectures, changes in pthread guts, etc, will all render cothreads
unusable).  Also, the push() and pull() handler functions (soon to be
merged into one, maybe switch()?) are also set for all the pads.

Then in the iterate function (which has been abstracted out so each Bin
subclass can provide its own), things actually get really simple.  All it
does is cothread_switch() to some arbitrary element's cothread state.
Currently it choses the first on the list, but this can be modified later
to provide context-driver functionality (where _iterate() actually
terminates at some point, say when that driver starts running a second
time.  there are reasons for this, albeit complicated).

So, the execution trace gets interesting, but it boils down the the simple
fact that any time a gst_pad_push() is done, the holding pen is filled and
the appropriate switch handler is called (push() currently), which in the
current implementation does nothing but do a cothread_switch() to the peer
element.  A gst_pad_pull() is similar except it switches whenever the
holding pen is empty and it wants to get a buffer.

The key that makes it work in both chain- and loop-function based
environments is the wrapper than actually runs the element.  When you
create a cothread state, you have to provide a function pointer as the
first bit of code to run when the cothread is actually created.  This
function is always provided by the Bin, and handles both cases.  In the
loop-function based case, it just calls the element's loop function.  A
neat trick is that it does so in a while(1) loop, so the element's 'loop
function' doesn't actually have to loop.  Not sure this is useful, but
it's 100% free in the usual case, so why not?

The chain-function based approach consists of a loop (while(1)) that runs
gst_pad_pull() and calls the chain() function for that pad.  Simple, eh?
Where it breaks is bascially the same problem you find with pure
chain-function setups, and that's in elements with mutliple inputs.  In
this scheme they're currently round-robined.  If they don't happen to want
to take inputs on a 1-for-1 basis while producing output regularly, one
leg or another is going to get lopped off, causing all kinds of scheduling
nightmares.  The solution in that case is: Don't Do It!  Use loop-function
based elements in any situation that comes even close to that.

This changes things in the sense that it's now not necessary to build
things like the mp3parse and ac3parse elements, since mpg123 and ac3dec
are bitstream-based elements and a good bitstream/getbits library should
be capable of pulling data on demand.  This way you just provide a
function that does a gst_pad_pull() and be done with it.

What is not dealt with yet is state transitions in said elements.
Consider the case of a MPEG video decoder that's got state to worry about.
If you were to switch states and cause the plan to be regenerated (I'll
try to do that write-up tomorrow sometime), you could end up switching
back into the decoder element in the middle of a decode, while providing
it data from a brand-new stream.  Thus, some reset mechanism should be
provided for elements that need such functionality.  Entirely optional, of
course.

Also, some of our discussions at work today pointed out the fact that we
need a pretty complete implementation of select(2) for inter-element
connections (I'm skirting around some competing terminology here... <g>),
which in the pure co-routine case means just 'randomly' switching to some
sourcing element.  Whichever one comes back first 'wins the select'.  In
the thread-boundary case things get more interesting, since you'll have a
queue attached.  In fact, I believe that the Connection (queue) case isn't
handled yet, I'll have to think through how that works and try it out.

A more generally useful capability is that of non-blocking pull()'s.
Strictly speaking, this is a bit of an oxymoron in most cases, thus it's
back to the queue case to figure out how that will work here.  That's the
only time you should really be blocking (on an empty queue), though that
would mean that someone isn't doing their job anyway....

Also, note that all of the cothread code is partitioned into the Bin
class.  This means that you can relatively trivially create your own
subclass (or even override just that part of an existing class for a given
instance [maybe?]) with custom scheduling routines, whether cothread-based
or not.  Though doing it without cothreads is going to get rather hairy in
a lot of cases.  That's where some of the Pipeline intelligence comes into
play, and the virtual FROZEN state.  Yet another thing to write up...

Sigh, I'm going to go sleep now.

TTYAL,
     Omega

         Erik Walthinsen <omega at cse.ogi.edu> - Staff Programmer @ OGI
        Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/
   Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
        __
       /  \             SEUL: Simple End-User Linux - http://www.seul.org/
      |    | M E G A           Helping Linux become THE choice
      _\  /_                          for the home or office user