[gst-devel] getbits/bitstream and parsing (long)

Sun Feb 13 10:28:34 CET 2000

I've been looking at mp1videoparse, and I think I know roughly what it
does, but I'm wondering why.  Is it necessary to split out every chunk in
the video stream?  Can the mpeg decoder be set up to do it internally?

To answer that there are several questions to be answered first.  The
easiest is what we're going to do about getbits/bitstream code.  Aaron
Holtzman's bitstream code as found in mpeg2dec (see LiViD CVS for latest)
is looking pretty good, though he reverted to globals to store things. 
I'm thinking that my own getbits code might not be the best thing, since
it forcible causes an indirect fuctional call in all cases.  Aaron's
latest code uses inline functions for the simple case and calls a 'bottom
half' in the complex case (out of bits or bytes).

I'd like to come up with some generic code (which includes the show() and
other functions in Aaron's latest bitstream.c) that can be inlined,
switched, and optimized.  The first thing is to determine whether or not
an indirect function call is expensive enough to negate the advantages of
an MMX getbits.  If it is, then we have to think about how to use
compile-time options to pick what's going to be used.

The idea would be to have a header file that would supply integer, MMX,
and switchout versions based on #ifdefs.  An element would then be
responsible for picking which to use.  For best results, any element that
could would do its own switchout.  The chain function could be set based
on the capabilities probe, and you'd have a couple versions: one does
scalar, the other MMX.  Extreme cases would have more, if someone asm'd
them for other architectures.  In fact, I already do this to some extent
in one of the elements (volume), as a test.

Once we have a good getbits routine in place that can ask for more bytes
(which won't work until I get the cothreads code fully functional), a seek
interface is in order.  This will be used to go to arbitrary points in the
stream.  The parser is responsible for seeking.  Chaining will be
supported, such that there will be a caching element designed to deal with
prefetching and back-seeks and such.  The parser would present a seek
interface, and then use the cache's seek interface to read bytes at the
appropriate offset.  The cache would use the real data source's seek
interface to do the real read.

Where things get interesting is when there are several parsers.  In the
case of mpeg1 video, we have the mpeg1parse element, as well as mp3parse
and mp1videoparse.  Obviously the system-stream parser will be doing the
seeking, since manipulating the audio or video independently doesn't make
sense.  However, when only audio or video is used, either of the other
parsers must be used, meaning the seek interface must be ignorable.

This brings up the point of how you seek in MPEG video streams.  It is not
at all obvious from the materials I have (Video Demystified, Digital
Video: Intro to MPEG-2) how one would go about seeking in time easily.  It
seems to me that the only way to do it is to estimate the byte offset at
current bitrate, seek, search for a GOP time_code or a picture
temporal_reference, determine error factor, and repeat.  The number of
iterations it takes to find the correct position is a function of the
variance in bitrate and the size of the seek.

MPEG audio is easier, except in the VBR case, in which case it's
significantly harder due to the lack of timecodes.  At least VBR streams
are usually only found standalone, and most(?) times come with Xing VBR
headers.

The question is how we want to structure the elements.  I'd like to
eventually get rid of the separate parser elements for all elementary
stream types (ac3parse, mp3parse, mp1videoparse) and shove them into the
associated decoder.  I suppose that could be a bad thing, since there are
multiple mp3 decoders and trying to keep all of them in sync with the API
could be a pain, so maybe not...  Hmmm.

Perhaps think of it this way: the parser current reframes the data.
Future parsers will not reframe data, relying on the decoder to do that
job (and probably do it better and with less overall overhead), but only
provide the seek control.  They would do only what is necessary to provide
seek, usually just passing buffers straight through.  In some cases they
might actually do some good, such as the fixed-rate mp3 case where the
parser could syncronize with the bitstream and set the blocksize
(currently bytes_per_read) to match the framesize, and thus set the type
of the outgoing buffers to audio/mpeg-frame instead of audio/mpeg.  The
smart decoder could use this as a hint to simplify their work.

Obviously none of this applies to the system-level parsers, since they are
just demux elements.  However, they are arguably more complex.  Time seek
is via similar optimizing search, with the advantage of guaranteed
timecodes at a much more regular interval.

Where it gets really hairy is when we start taking about scrubbing
(defined here as repeated arbitrary seek, time-warp, and/or reverse). MPEG
is not designed to be easily scrubbed, since the GOP has to be decoded
forward and mostly complete in order to get access to random frames. 
Running at double-speed means some heavy overhead.  Consider a
IBBPBBPBBPBBPBB GOP.  You have to decode the uppercase frames: 
IbBPBbPbBPBbPbB.  That's 10 out of 15 frames when you're only displaying
7.5.  50% overhead...  Obviously running other sequences or rates could
bring things back down to 0 overhead, but in order to accomplish arbitrary
time-warp a machine's going to have to have 50% more power than is
necessary for a straight-through play. 

Reverse is easier, except for the little fact that you're going to have to
buffer the entire decoded GOP in the naive case, or at minimum M+2 frames
if you're careful (M is P-frame distance).

The question is how can scrubbing be integrated into the seek interface,
and how do we implement it for both the elementary and system cases?
We'll ignore the fact that the framed output of things like audio decoders
has to be reversed on its way out for now.  The problem is that the system
stream decoder has to know the same thing the elementary stream parser
does, which indicates that they should be cooperative.  This leans back
towards having a parser for each elementary stream between the demux and
decoders, though with the parsers under the control of the demux somehow.

Bleagh, I'm getting tired, I need some sleep.

         Erik Walthinsen <omega at cse.ogi.edu> - Staff Programmer @ OGI
        Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/
   Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
        __
       /  \             SEUL: Simple End-User Linux - http://www.seul.org/
      |    | M E G A           Helping Linux become THE choice
      _\  /_                          for the home or office user