Does it make sense to consider timestamps of each individual video frames/RTP buffers applied at Tx device, to detect delay at Rx device?

vk_gst venkateshkuppan26 at
Tue Nov 6 09:05:51 UTC 2018

Hello Together,

I have asked this as a follow up question to my previous post, but I think
its better I make it as a separate question to explain more details about
the problem I am facing. Please bear the long question :)

Here are my pipelines, that transmit live video from iMX6 device to Ubuntu
PC over WiFi:

Tx pipeline: 
v4l2src  fps-n=30 -> h264encode ->  rtph264pay -> rtpbin ->
udpsink(port=5000) -> 
rtpbin.send_rtcp(port=5001) -> rtpbin.recv_rtcp(port=5002) 

Rx pipeline: 
udpsrc(port=5000) -> caps -> rtpbin -> rtph264depay -> h264parse -> 
avdec_h264 -> 
rtpbin.recv_rtcp(port=5001) -> rtpbin.send_rtcp(port=5002) -> videosink 

Now as per my application, I intend to detect the delay in receiving frames
at the Rx device. The delay can be induced by a number of factors including:
- congestion
- packet loss
- noise , etc. 
Once the delay is detected, I intend to insert a IMU(inertial measurement
unit) frame (custom visualization) in between the live video frame. For eg,
if every 3rd frame is delayed, the video will look like:  
                        V | V | I | V | V | I | V | V | I | V | .....

where V - video frame received and I - IMU frame inserted at Rx device

1. Hence as per my application requirements, to achieve this I must have a
knowledge of the timestamp of the video frame sent from Tx, and use this
timestamp with the current timestamp at Rx device to get the delay in

   frame delay = Current time at Rx - Timestamp of frame at Tx 
Since I am working at 30 fps, ideally I should expect that I receive video
frames at the Rx device every 33ms. Given the situation that its WiFi, and
other delays including encoding/decoding I understand that this 33ms
precision is difficult to achieve and its perfectly fine for me.  

2. Since, I am using RTP/RTCP , I had a look into WebRTC but it caters more
towards sending SR/RR (network statistics) only for a fraction of the data
sent from Tx -> Rx.  I also tried using the UDP source timeout feature that
detects if there are no packets at the source for a predefined time and
issues signal notifying the timeout. However, this works only if the Tx
device completely stops(pipeline stopped using Ctrl+C). If the packets are
delayed, the timeout does not occur since the kernel buffers some old data.

I have the following questions : 

1. Does it make sense to use the timestamps of each video frame/RTP buffers
to detect the delay in receiving frames at the Rx device ? What would be a
better design to consider for such an usecase ? Or is it too much overhead
to consider the timestamp of each frame/buffer and may be I can consider
timestamps of factor of video frames like every 5th video frame/buffer, or
every 10 the frame/buffer? Considering the worst case possible where each
alternate frame is delayed the video would have the following sequence :  
                   V | I | V| I | V | I | V | I | V | I | .....
I understand that the precision of each alternate frame can be difficult to
handle, so I am targetting a detection and insertion of IMU frame atleast
within 66 ms. Also the switching between live video frame and insertion
frame is a concern. I use the OpenGL plugins to do IMU data manipulation.

2. Which timestamps should I be considering at the Rx device? To calculate
the delay, I need a common reference between the Tx and Rx device, which I
do not have a knowledge about. I could access the PTS and DTS of the RTP
buffers, but since no reference was available I could not use this to detect
the delay. Is there any other way I could do this? 

3. My caps has the following parameters (only few parameters showed) : 
application/x-rtp , clock-rate = 90000, timestamp-offset =
2392035930,seqnum-offset= 23406
Can this be used to calculate the reference at Tx and Rx ? I am not sure if
I understand these numbers and how to use them at Rx device to get a
reference. Any pointers on understanding these parameters? 

4. Any other possible approaches that can be undertaken for such an
application. My above idea could be too impractical and I am open to
suggestions to tackle this issue. 

Since this is my University project, I rarely have any support available.
Would be great if someone points me to some direction, be it completely new
or improvement to the current design. 

Adding links to previous post, which are also related to the same question:




Sent from:

More information about the gstreamer-devel mailing list