[Bug 752603] qtdemux: Unable to play streaming MP4 (H264+AAC) file from VLC

Wed May 30 20:37:59 UTC 2018

https://bugzilla.gnome.org/show_bug.cgi?id=752603

Alicia Boya García <aboya at igalia.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aboya at igalia.com

--- Comment #52 from Alicia Boya García <aboya at igalia.com> ---
(In reply to Thiago Sousa Santos from comment #51)
> Should we just remove the idea of having a movie segment and always rely on
> stream segments? The idea of movie segments was already there, in some
> places we sent the movie, in others the stream segment which is not ideal.

No, we should not; at least while we continue to use GstSegment for handling
edit lists (which is the least bad solution I know until we get negative
timestamps in  GstBuffer, which will not happen in gst1).

This is a very long response, but I hope it serves to clarify the confusion
around segments and edit lists in qtdemux.

The movie segment and track segments serve different purposes in the current
design of qtdemux and should not be mixed carelessly.

The movie segment (the one in qtdemux->segment) specifies the stream time range
the user wants to play. It's modified when the user seeks. It affects all
tracks. For correct edit list handling this segment SHOULD NOT be emitted
downstream directly. Note this segment does not tell us how to map buffer time
to stream time (it can't, as the mapping is usually different for each track).
It just serves to express the wish of the user to play a certain stream time
range, at a certain playback rate.

Buffer time in qtdemux does not account for edit lists. It comes almost
straight out from the tables found in MP4, converted to nanoseconds. CTS
becomes GstBuffer.pts and (media) DTS becomes GstBuffer.dts. These timestamps
are referred collectively in the spec as "media time".

As surprising as it may sound, the frames in a MP4 track are not intended to be
decoded or played directly in the order they are declared in the frame tables.
Instead, a MP4 track is constructed with one or more edits (also called
"segments", but that word is extremely overloaded, so I'll avoid it in this
context) that are defined in a edit list (elst.edts box) inside the trak atom
of the moov.

Each edit references a range of "media time" that ought to be played. Edits are
played in the order they are defined in the edit list.

Take a look at the following illustrative edit list. In practice edit lists
with several media edits like this are pretty rare -- usually edit lists are
used just to set a small offset between media time and stream time, but looking
at the complex case helps to understand their design and the mechanisms
involved.

duration | media_time | rate
-----------------------------
5 s      | 30 s       | 1.0
3 s      | 10 s       | 1.0
5 s      | 30 s       | 1.0

Assuming that every frame is 1 second long, this is the succession of frames
that should be played for this file, in media time PTS:

30 31 32 33 34 10 11 12 30 31 32 33 34

Edit lists allow to skip frames that are coded in the file, play them several
times, play portions of them in an order different to the one they are coded
and even play them at different speeds.

This dance of frames inside the tracks should be transparent for the user, who
should see in their player the following (stream time) timestamps in the
progress bar when playing the movie:

00 01 02 03 04 05 06 07 08 09 10 11 12

The reported duration of the track should be 13 seconds (the result from
summing the duration of the edits), regardless of the duration of the frames
that are actually coded in the mdat. (Note this duration may be bigger or
smaller than the duration of the coded frames, since edits can be repeated and
not all coded frames need to be included in an edit.)

In qtdemux edit lists are accounted for by emitting a GstSegment for every
edit. The previous file -- in absence of qtdemux bugs -- would output the
following sequence of segments and buffers when played from the beginning:

GstSegment: time=0 base=0 start=30 stop=35
Buffer: pts=30
Buffer: pts=31
Buffer: pts=32
Buffer: pts=33
Buffer: pts=34
GstSegment: time=5 base=5 start=10 stop=13
Buffer: pts=10
Buffer: pts=11
Buffer: pts=12
GstSegment: time=8 base=8 start=30 stop=35
Buffer: pts=30
Buffer: pts=31
Buffer: pts=32
Buffer: pts=33
Buffer: pts=34

For each buffer we can run the following formulas to translate buffer PTS or
buffer DTS from the "media timeline" to stream time (timestamps that should be
displayed in the application) and running time (timestamps that run since
playback started and are used for synchronization).

stream_time = (B.timestamp - S.start) * ABS (S.applied_rate) + S.time
running_time = (B.timestamp - (S.start + S.offset)) / ABS (S.rate) + S.base

In this case, since the file was played from the beginning without pauses
stream_time equals running_time, but that's not the case when seeking. The
demuxer should emit appropriate segments for each track in order to output the
correct frames with the correct stream_time and running_time.

Let's look at an example of seeking in the file with the edit list shown
before.

It's important to note that the seek request is handled by the whole demuxer,
not by its individual tracks; therefore the timestamps refer to movie time
(stream time of the tracks), not buffer time of any track.

seek GstSegment: start=6 stop=11

The expected sequence of segments and buffers that qtdemux should emit in
response to the seek is this:

GstSegment: time=6 base=0 start=11 stop=13
Buffer: pts=11 (stream time PTS=6, running time PTS=0)
Buffer: pts=12 (stream time PTS=7, running time PTS=1)
GstSegment: time=8 base=2 start=30 stop=32
Buffer: pts=30 (stream time PTS=8, running time PTS=2)
Buffer: pts=31 (stream time PTS=9, running time PTS=3)
Buffer: pts=32 (stream time PTS=10, running time PTS=4)

(This assumes all frames are sync frames for simplicity... additional frames
need to be prepended to the first buffer if it's not a sync frame.)

At any point in time, qtdemux->segment should contain the current seek segment,
whilst for each track stream->segment contains the current segment used to map
the frames in the current edit from buffer time to stream time and running
time. Therefore stream->segment should change and a new segment event should be
emitted downstream not only when the user performs a seek, but also when a edit
is finished but there is another one following.

## What are edit lists used for, really?

In reality, edit lists are rarely used for such complex editing tasks like in
the example before. Correct edit list handling adds a lot of complexity to
demuxers in exchange for obscure features that are rarely used.

Despite this, there is a small use case when edit list were deemed fit and used
by many, if not most, MP4 files: having correct timestamps when using B-frames.

The problem goes like this: imagine you have a video track with three frames
like this:

.    +---+   +---+   +---+
.    | I |-->| B |<--| P |
.    +---+   +---+   +---+

PTS    0       1       2

In the example above, B has a dependency on a frame that is after it in
presentation order. Therefore, we need to encode the frames in a different
order so that the decoder is able to decode the frames successfully: IPB. This
reorder is accomplished in MP4 with the DTS timestamps, but there are several
restriction we need to take into account:

a) The DTS of the first frame is always zero. This is due to the way frames are
coded in MP4 tables (DTS is computed by summing the duration of the previous
frames in the table).

b) For every frame, PTS >= DTS. You can't show a frame before it has been
decoded, after all. Indeed, in most files PTS is coded as an unsigned offset
from DTS in MP4 tables. Newer versions of MP4 allow to use a signed offset and
therefore break this rule, but that is only intended for coding non-displaying
frames (where a negative PTS is used to signal they should never be displayed)
and in other cases it must be corrected with a cslg box -- which will be
explained later.

With these restrictions, this is the best we can get -- at least without
mangling with the frame durations:

.    +---+   +---+   +---+
.    | I |-->| B |<--| P |
.    +---+   +---+   +---+

PTS    1       2       3
DTS    0       2       1

Note that we no longer have any frame at PTS=0... we had to shift all the PTS
to satisfy coding dependencies. Most of the time the user will not notice any
difference (it's just one frame duration), but that makes it an imperfect
solution; we would expect movies to play the first video frame at 0:00, not at
0:01.

The most commonly supported way to fix this issue is with a simple edit list,
like this:

duration | media_time | rate
-----------------------------
3 s      | 1 s        | 1.0

This is the most common kind of edit list, referred unofficially as "basic edit
list". Once applied, stream time PTS=0 (as shown in the player UI) corresponds
to the first frame (who has buffer PTS=1).

## Empty edits

It's also possible to specify that in a certain stream time range there will be
no frames. This is called an empty edit, and it's coded by specifying
media_time = -1.

The most common use of empty edits is to offset a track in such a way that
stream time is greater (not lesser) than unedited media time. This is useful
for instance to fix A/V synchronization, especially when the audio and video
tracks don't have exactly the same length.

As an example, imagine we have initially an audio track like this:

duration | media_time | rate
-----------------------------
100 s    | 0 s        | 1.0

If we wanted to discard the first 2 seconds of the audio track we would just
edit media_time like in the previous example:

duration | media_time | rate
-----------------------------
98 s     | 2 s        | 1.0

On the other hand, if we want to displace the audio 2 seconds ahead we need to
insert an empty edit. The first 2 seconds of the movie will have no sound, then
the audio track will start playing from the first frame. The resulting track is
102 seconds long.

duration | media_time | rate
-----------------------------
2 s      | -1         | 1.0
100 s    | 0 s        | 1.0

Currently qtdemux emits GstSegments for empty edits too. This should be
unecessary, as empty edits, by definition, contain no frames. If you look at
them, beware of their values: For empty edits, start and stop refer to stream
time, not buffer time (after all... there is no buffer time to map in this case
since there are no frames).

## Edit list support

In qtdemux full edit list support is targeted only in pull mode, since that
scheduling mode is much more appropriate for the kind of dance performed by
advanced edit lists.

Support for advanced edit lists (those with more than one non-empty edit) is
quite limited in most players.

Basic edit lists have much better support. Commonly used muxers use them to
make sure the first video frame starts at stream time PTS=0 as explained
before.

qtdemux allows only basic edit lists in push mode.

## Bonus section: About cslg_shift

The iso4 brand (introduced in ISO 14496-12:2012) adds a new box (cslg:
composition to decode timeline mapping) that introduces an alternative way to
solve the same PTS=0 issue. It's rarely used though, as basic edit lists have
better player support, but you may still see cslg_shift in many places in
qtdemux, so it's a good idea to know how it works.

Starting at the iso4 brand, you can use ctts version 1 which allows to use
signed composition offsets, therefore allowing to code a PTS < DTS. Therefore
you can code the following in the MP4 sample tables:

.    +---+   +---+   +---+
.    | I |-->| B |<--| P |
.    +---+   +---+   +---+

PTS    0       1       2
DTS    0       2       1

But there is a catch: The coded DTS is used as buffer DTS in gstreamer, just as
before; the PTS on the other hand needs to be adjusted so that buffer PTS >=
buffer DTS. This is done by adding cslg.compositionToDTSShift ("cslg_shift" in
qtdemux) to every PTS value.

/* timestamp is the DTS */
#define QTSAMPLE_DTS(stream,sample) (QTSTREAMTIME_TO_GSTTIME((stream),
(sample)->timestamp))
/* timestamp + offset + cslg_shift is the outgoing PTS */
#define QTSAMPLE_PTS(stream,sample) (QTSTREAMTIME_TO_GSTTIME((stream),
(sample)->timestamp + (stream)->cslg_shift + (sample)->pts_offset))

In this example cslg_shift needs to be 1 second, as that's the lowest value we
need to increment PTS so that the PTS >= DTS rule becomes true again.

But of course, we still want the first frame to be presented at running time
and stream time zero. Therefore, we need to take cslg_shift into account in our
GstSegment computation: GstSegment.start and GstSegment.stop -- that is, the
fields that refer to buffer time -- must have `cslg_shift` added.

GstSegment: time=0 base=0 start=1 stop=4

cslg and edit lists are not exclusive to each other. The media_time specified
in edit lists refers to PTS as it's coded in the file. For instance, this
example could have the following super simple edit list that performs no
adjustment of PTS, since that is already taken care of by cslg:

duration | media_time | rate
-----------------------------
3 s      | 0 s        | 1.0

## qtdemux in MSE vs qtdemux in a full player

This is mostly unrelated to the previous points, but still something that has
to be taken into account when working with segments in a demuxer.

A typical player (simplified) looks like this:

  src -> queue -> qtdemux -> decoder -> sink

In this setup, qtdemux is responsible not only for demuxing, but also for
handling seeks. That makes sense usually because it's the element that has the
frame tables and therefore knows where to look for the frames in the file.

In DASH this stops being true. In this case often separated fragmented MP4
files are used, so qtdemux itself cannot know where to locate a frame outside
the current frame. Instead, dashdemux or another similar element that knows how
to locate those files (e.g. by parsing an MPD file) handles those seek requests
by downloading a different file and feeding it to qtdemux, who will only be
responsible for the seek inside the fragment.

In MSE (Media Source Extensions) the separation of responsibilities is even
bigger. There is no dashdemux or anything like that. Parsing the MPD file (or
any other similar manifest) and downloading fragments is done entirely by a
JavaScript application. This JS application sends those fragments to a demuxer
(qtdemux or matroskademux) using a the SourceBuffer API. The demuxer is
responsible for demuxing the received fragments into frames (GstSample objects)
with correct stream time timestamps. Those frames end up stored in a series of
data structures that are read later by the media player -- which is handled in
a completely separate GStreamer pipeline. The demuxer is not involved in
playback and has no role in handling seeks at all. The running time generated
by qtdemux in this set-up is meaningless.

  AppendPipeline (demuxer): JavaScript source -> qtdemux -> browser
  PlaybackPipeline (MediaPlayer): browser internal APIs -> decoder -> sink

The API used to demux frames in MSE is very slim. There is `appendBuffer()`
which feeds some bytes and `abort()` (which flushes the current fragment).
That's all. The demuxer is responsible for demuxing whatever fragments it
receives into correct frames. The JS application may feed the fragments in any
order as long as a moov has been parsed first (e.g. a fragment spanning [10,
20) seconds may be demuxed before a fragment spanning [0, 10) s if the former
finished downloading faster). This is something that is unlikely to happen in
non-MSE playback but should be contemplated for this kind of applications.

Since gst_segment_to_stream_time_full() can handle positions outside [start,
stop), emitting a GstSegment that starts at a non-zero position can be fine as
long as frames received later with a PTS < start are emitted too and stream
time is correct.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.