Webrtcbin: perfectly timestamped transmission of (non-live) A/V file source

Sat Mar 12 14:27:21 UTC 2022

Hi all,

I am working on a somewhat experimental project to implement a "WebRTC
media player" with gstreamer. I am aiming to stream basically any A/V
media file over WebRTC at a quality, as near as possible to protocols
like MPD or HLS. Obviously, latency does not play a role here, but
there will be focus on trying to keep the playout in sync as much as
possible between the different peers being served in parallel.

I know, this is not exactly the target application for WebRTC, but
thats also what makes it interesting as an experimental project. I
already read about the playout delay WebRTC extension, which will be
worth looking at at some point, but I am far from being there.

At the moment, I am just trying to get the single peer stream with a
stable LAN connection as smooth as possible. I started with video-only
transmission, not caring too much about jitter, which was pretty
straight-forward. Now I am looking into audio, and it becomes far more
complicated.

For now, I am testing an audio-only transmisstion, but of course I
need to add synced video back later.

I learned that filesrc does not fulfill the "live/sync" requirement to
playout audio through rtpopuspay. There are some places on the web
mentioning that multifilesrc combined with "identity sync=true" is
suited to create a "fake live" media source.

I played with various variants, unsure where to put the identity
element, and the "do-timestamp" (and if its needed). As a reference,
this is my current pipeline which may contain some clutter, but its
about the best I got judging by audio quality:

multifilesrc location=... index=0 do-timestamp=1 ! queue ! decodebin !
queue  ! identity sync=true ! queue ! audioconvert ! audioresample !
audio/x-raw,channels=2,rate=48000 ! opusenc bitrate=128000
frame-size=10 ! rtpopuspay pt=97 min-ptime=10000000 max-ptime=10000000
! webrtcbin

I chose frame-size=10, because the playback artifacts are worse with
smaller frame-sizes. So, chosing a rather small frame size, its easier
to spot them. Once I got the issues resolved, I plan to use
frame-size=60. min-ptime and max-ptime is something I found mentioned
somewhere. I thought it might help to reduce jitter in rtp timestamps,
but Im unsure about the effect.

Playing this in the browser (using music input), I can hear some
artifacts that are annoying but would be acceptable for voice as far
as I can tell. My current focus is to get this audio as good as
possible.

chrome://webrtc-internals tells me that samples are added and removed
to adjust timing, which explains the perceived artifacts pretty
accurately. Looking at the decoded RTP, I can see that rtp timestamps
are indeed slightly jittered. Ideally, I would expect the RTP
timestamp increments to be steadily at 480. (48kHz @ 10 ms frame
size).

This is now, where I could use some help....

What I expect to work (didnt test yet) is to use the datarate option
of the identity element on either raw or strictly CBR encoded audio,
to rewrite timestamps. However, then I will probably lose the original
timing information, and with it my option to keep the video in sync.
My guess is, in the end I need a typical media player logic to sync
everything on audio. But obvioulsy I am too lost to find the best way
to get there.

Thanks for reading so far, I hope I could make my issue clear. Any
pointers or comments would be useful. Some specific questions I could
think of:

- whats the easiest way to get this sorted? I mean, transmitting a av
media file's audio continuously, with no rtp timestamp jitter, and
with video in sync to it?
- In the above pipeline, where exactly is my rtp jitter coming from? I
assume "identity sync=true" is timing the throughput based on the
sources timestamps, and the jitter is basically a quantization
artifact because the original framesize does not match the re-encoded
framesize. Is this plausible? That would mean, the "sync/live" nature
of the source is not needed at all for timestamps, just for
transmission timing (correct?)
- Am I assuming correctly that rtpopuspay does not keep a context of
the stream, and is creating an rtp timestamp with no knowledge about
the last packets timestamp?
- Are there any handy debug level filters to trace consecutive packets
through the pipeline, regarding their timestamp?
- Is there a concise documentation, how many and which timestamps a
frame can actually have in gstreamer, where they are deduced from and
what they are generally used for (only in case there is anything else
except PTS and DTS - I am still in doubt if there is something third,
like MPEG2TS's PCR involved here...)

Thanks for any help!
Philipp