[gst-devel] ogg and friends
Thomas Vander Stichele
thomas at apestaart.org
Wed May 19 10:43:06 CEST 2004
OK, so here's a mail trying to explain specifically what is going on and
what we need to support in GStreamer so we can do ogg streaming (bear in
mind, nobody is doing ogg/theora streaming yet).
Some things that need to be understood before trying to participate in
the discussion (I assume some of you already know all of this, but I'm
stating it clearly here so people can be sure they understand the
- "Vorbis" and "Theora" are codec formats.
codecs take in raw data (video frames, chunks of audio samples) and
produce encoded data in packets. "Vorbis" and "Theora" are codecs made
by xiph. To be able to decode encoded data, decoders for "Vorbis" and
"Theora" need to get data in the same packets as they were encoded.
(This means, a file dump of packets produced by "Vorbis" or "Theora"
encoders cannot be played back - since you lost the packet boundary
- "Ogg" is an encapsulation format.
It provides (for the scope of this discussion) encapsulation and
high-level seeking on the page level. "Ogg" knows about packets handed
to it, and uses pages to transport these packets on a higher level.
Packets handed to it also have "granulepos" (which is a codec-specific
mapping between "time" and "some internal format") and size info. Ogg
encapsulates these packets into "pages". A page can contain more than
one packet, and packets can be fragmented across pages. The page
structure provides packet boundaries, error detection, high-level seek
marks. (Thinkg of pages like atm frames or transmission protocol packets)
- "Ogg Vorbis" and "Ogg Theora" or "Ogg A/V" are media mappings.
Media mappings describe how codecs are encapsulated in the muxer, and
can thus impose additional constraints on how the ogg muxer is to behave
to implement this media mapping.
- Pay close attention to the difference between "muxer", "codec" and
"media mapping". In effect, the muxer doesn't need any codec-specific
knowledge, and the codec doesn't need muxer-specific knowledge. It's
the media mapping implementation that needs to have knowledge of both.
The vorbis encoder produces packets based on incoming sample data.
- packet 1: small header packet, containing the Vorbis name+revision,
the audio rate, and the audio quality.
- packet 2: comment packet
- packet 3: setup packet
- all other packets: encoded audio data, with "timestamps" as the
- packet 1: small header packet, containing theora name+revision, +
video caps (w/h, aspect, fps, colorspace, quality, ...)
- packet 2: comment
- packet 3: setup data (tables)
- "packets" are raw packets from an encoder, with size and granulepos
- "pages" group one or more packets; packets can span pages (for
example, when comment data is larger than the max page size, which is
close to 64K)
- "logical ogg stream" is the stream for the encoded version of one
original raw feed (ie. encoded audio stream or encoded video stream without
- "grouped physical ogg stream" or "multiplexed physical ogg
stream" is a stream where the pages from several logical ogg streams have been
interleaved. (ie, muxing is done on the page level). This is how multiple media
tracks are transmitted concurrently to produce for example a video file with
one or more audio tracks.
- Ogg has three types of pages:
- 1 "bos page" (beginning of stream), contains only one ogg packet,
which is typically very small and just enough info to identify the
stream format for that logical ogg stream
(while this is part of the current spec online, there's some confusion inside
Xiph as to whether this really is part of Ogg, or just part of the media
- 0 or more "data pages" that contain the actual encoded data
- 1 "eos page" (end of stream)
(technically speaking, these are all the same pages, but with a flag for
bos or eos in its header)
Some media mapping also specify that inbetween the bos page and the data pages,
you can have:
- 0 or more "additional header pages" which can contain multiple
- In a logical ogg stream, the bos page comes before optional additional
header pages. The header pages come before any of the data pages.
- In a "grouped physical ogg stream":
- all "bos pages" for each of the logical streams come first. The
order is not specified by ogg, but by a "media mapping".
- all "additional header pages" for each of the logical streams come
- after that, "data pages" from each of the streams are sent.
- theora uses frame number as granulepos
OGG VORBIS (media mapping)
The "Ogg Vorbis" media mapping imposes that:
- the first vorbis packet is used for the "bos" page
- vorbis packet 2 and 3 are sent in 1 or more "additional header pages",
but can be combined (ie if they're short enough, packet 2 and 3 can be
in one page).
- all other vorbis packets are sent in "data" pages
- first theora packet for "bos" of video stream
- first vorbis packet for "bos" of audio stream
- "additional header pages" (1-x) for video stream
- "additional header pages" (1-y) for audio stream
- interleaved "data" pages for both audio and video stream
- any streaming decoder of "Ogg A/V" streams will need a way to get at
all the non-data starting pages for the stream. This means, vorbis bos,
vorbis theora, vorbis ahp, theora ahp. When he has those, he can decode
the live "Ogg A/V" stream at any point.
This is exactly how icecast works, for example.
- Here are the cases that I feel we need to get working:
- encoding and muxing an "Ogg A/V" stream and saving it to disk:
... oggmux ! filesink
- encoding and muxing an "Ogg A/V" stream and create a live TCP stream:
... oggmux ! tcpserversink
clients can connect to this server at any given point in time, and
when they do, tcpserversink should give them all the "bos pages" and
"additional header pages" before serving them encoded data.
Others have in this thread said that the following should also work:
- having identity after oggmux
... oggmux ! identity ! filesink
- having oggmux ! fakesink, and at some point replace fakesink with
Suppose we use caps the way Ronald suggested. Here's what will happen.
- vorbisenc will create three packets. These three packets are put into
one caps value, as a GList of buffers (the packets really need to be
- vorbisenc could, or could not, send out these buffers as GstBuffers as
well (there are reasons for doing so and reasons for not doing so; we'll
see about this later)
- theoraenc will do the same.
- oggmux will have two sink pads. Those pads have caps with a glist of
GstBuffers that form the packets. Oggmux will examine them, wrap the
packets in ogg pages, and thus have a set of pages it can output.
- if we want oggmux ! filesink to work, oggmux *has* to send these out
as buffers, since for an ogg stream they are part of the data stream.
- if we want oggmux ! tcpserversink to work, there *has* to be a way for
tcpserversink to know what bytes are "part of the header" and have to be
transmitted to each client before sending them the live encoded data.
- suppose we try doing this with a caps field served by oggmux that we
make tcpserversink aware of (which would be ok, since all streaming
elements would need such a concept). oggmux would have a "header" caps
field that contains a GList of GstBuffers where each GstBuffer is one
complete ogg page.
- now tcpserversink needs a mechanism for making sure that it doesn't
send these header pages twice; it receives them once as GstBuffers
(since we wanted to make oggmux ! filesink work) and once as "header"
caps. It could, for each incoming buffer, compare pointers with the
GList of GstBuffers on its caps, which would become expensive quickly.
- Alternatively, each element that puts GstBuffers on caps could also
mark that buffer with a flag that basically means "this buffer is also
part of stream caps somewhere". This would allow tcpserversink and
similar to optimize the checking; they only need to go through the list
when they receive a buffer with this flag.
- The same "problem" exists for vorbisenc and theoraenc, although in
their case saving them to a file directly makes no sense in the first
place, so you could work around the issue. For clarity and consistency,
it would make sense however to make them behave the same way: put header
packets in the caps on a GList, and mark them with the
BUFFER_PART_OF_CAPS flag before pushing them out.
- We could have a caps function that does all of the needed bits
automatically: given a buffer,
- put it in the caps
- ref it
- mark the buffer with the flag
- oggmux ! identity ! fakesink, where fakesink is later replaced with
filesink, for me personally, is a case of "please don't be silly". If
you want to deliberately break the framework by doing stupid stuff, it
will still be possible. So will doing filesrc ! fakesink, then later
replacing fakesink with oggdemux ! vorbisdec ! ... break as well.
To support all use cases we need to support:
- ogg has to output bos and additional header pages as part of the data
- ogg could set them as part of the src caps as well, using a GList of
- vorbis and theora could for consistency do the same
- as a very useful optimization, a buffer flag could be used to mark the
buffers that were also transmitted through caps
- the concept of "media mapping" can be specified with an enum on oggmux for now
until there is a need to separate it out. By default it would be auto, and it
could also be set to "none" (which would just map streams first-come first-serve
without problems, and would work for even MPEG video for example), "Ogg Vorbis" or
So, anyone see problems with this approach ? I feel that the extension
needed is generic enough. Still not very comfortable with putting big
buffers in caps (how about 128K caps because of jpg in vorbiscomment
headers ?) but I'll suck it up :)
Dave/Dina : future TV today ! - http://www.davedina.org/
<-*- thomas (dot) apestaart (dot) org -*->
Get a fella's motor running, let the tension
marinate a couple of days, then BAM !
You crown yourself the ice queen.
<-*- thomas (at) apestaart (dot) org -*->
URGent, best radio on the net - 24/7 ! - http://urgent.fm/
More information about the gstreamer-devel