Trying to keep non-transcoded audio and transcoded video in sync

Matthew Shapiro me at mshapiro.net
Sun Feb 13 18:23:43 UTC 2022


I have a custom written RTMP server written in Rust, and I'm trying to implement gstreamer to provide dynamic transcoding pipelines.  I wrote all the RTMP code by hand so I have some media knowledge, but I'm far, far from an expert and I'm confused about some timing issues I'm getting.

I have a pre-encoded FLV of big buck bunny and I am using ffmpeg to push the video (without transcoding) into my RTMP server, and I'm using ffplay to act as an RTMP client.  When I do zero transcoding the audio and video are perfectly synced, but once I pass the packets through the x264enc for encoding, audio is now several seconds before the corresponding video.  My understanding is that RTMP clients/video players would use the RTMP Timestamp values to keep both audio and video in sync, but either the timestamps that I'm getting from gstreamer are incorrect or I'm misunderstanding something.

For transcoding I'm using the following pipeline for my proof of concept:

-------------------
appsrc name=input ! decodebin ! videoscale ! video/x-raw,width=800,height=600 ! x264enc speed-preset=veryfast name=encoder ! h264parse name=parser ! appsink name=output
-------------------

When an RTMP publisher connects and starts sending in media, if the packet they supply is an audio packet I pass it right to the RTMP clients for playback.  However, if the packet they send is a video packet I use the following code to build a gstreamer buffer for my appsrc.

-------------------
pub fn push_video(&self, data: Bytes, timestamp: RtmpTimestamp, is_sequence_header: bool) {
    if data.len() <= 4 {
        return;
    }

    let mut buffer = Buffer::with_size(data.len() - 4).unwrap();
    {
        // parse AVCPACKETPACKET.CompositionTime
        let byte1 = data[1];
        let byte2 = data[2];
        let byte3 = data[3];

        let pts = ((byte1 as u64) << 16) | ((byte2 as u64) << 8) | (byte3 as u64);
        let pts = pts + timestamp.value as u64;

        let buffer = buffer.get_mut().unwrap();
        buffer.set_dts(ClockTime::from_mseconds(timestamp.value as u64));
        buffer.set_pts(ClockTime::from_mseconds(pts));

        let mut samples = buffer.map_writable().unwrap();
        {
            let samples = samples.as_mut_slice();
            for index in 4..data.len() {
                samples[index - 4] = data[index];
            }
        }
    }

    if is_sequence_header {
        self.video_source.set_caps(Some(
            &Caps::builder("video/x-h264")
                .field("codec_data", buffer)
                .build()
        ));
    } else {
        self.video_source.push_buffer(buffer).unwrap();
    }
}
----------------------

The timestamp value is coming from the RTMP chunk itself, and from my tests it appears that dts is equal to the composition time of the AVCVIDEOPACKET from the flv packet.  The "-4" aspects are due to my code stripping the FLV tag but not the AVCVIDEOPACKET headers yet, thus the "-4" (this is all still POC).  `self.video_source` is the "input" appsrc element pulled from the pipeline by name.

When the buffer goes through the pipeline it gets picked up by the appsink via

----------------------
let mut sent_codec_data = false;
video_app_sink.set_callbacks(
    AppSinkCallbacks::builder()
        .new_sample(move |sink| {
            if !sent_codec_data {
                let caps = parser.static_pad("src").unwrap().caps().unwrap();
                let structure = caps.structure(0).unwrap();
                let codec_data = structure.get::<Buffer>("codec_data").unwrap();
                let map = codec_data.map_readable().unwrap();

                let mut bytes = BytesMut::new();
                bytes.put_u8(0); // sequence header
                bytes.put_u8(0);
                bytes.put_u8(0);
                bytes.put_u8(0); // 0 composition time
                bytes.extend_from_slice(map.as_slice());

                let _ = media_sender.send(RtmpEndpointMediaMessage {
                    stream_key: stream_key.clone(),
                    data: RtmpEndpointMediaData::NewVideoData {
                        data: bytes.freeze(),
                        is_sequence_header: true,
                        is_keyframe: false,
                        timestamp: RtmpTimestamp::new(0),
                        codec: VideoCodec::H264,
                    }
                });

                println!("Sent sequence header");

                sent_codec_data = true;
            }

            let sample = sink.pull_sample().unwrap();
            let buffer = sample.buffer().unwrap();
            let map = buffer.map_readable().unwrap();

            let pts = buffer.pts().unwrap();
            let dts = buffer.dts().unwrap();
            let mut data = BytesMut::new();

            let composition_time = (pts - dts).mseconds();

            data.put_u8(1); // AVC NALU
            data.put_u8((composition_time >> 16) as u8);
            data.put_u8((composition_time >> 8) as u8);
            data.put_u8((composition_time) as u8);
            data.extend_from_slice(&map);

            let _ = media_sender.send(RtmpEndpointMediaMessage {
                stream_key: stream_key.clone(),
                data: RtmpEndpointMediaData::NewVideoData {
                    data: data.freeze(),
                    codec: VideoCodec::H264,
                    timestamp: RtmpTimestamp::new(dts.mseconds() as u32),
                    is_keyframe: false, // todo: how do I determine this
                    is_sequence_header: false,
                }
            });

            Ok(FlowSuccess::Ok)
        })
        .build(),
)
----------------------

In this code I'm forming the RTMP chunk timestamp based on the dts value of the resulting buffer, and setting the composition time to the difference between pts and dts (the opposite of how I did the original buffer).

If I remove the `x264enc` step of the pipeline I end up in no delay, which of course makes sense as there's no computational delay due to transcoding.  Do I have to do something special to ensure that video players keep the audio and video in sync?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/gstreamer-devel/attachments/20220213/ef75af2a/attachment.htm>


More information about the gstreamer-devel mailing list