Reading frame numbers in a local video file with precision

Will McElderry wm-gstreamer at switchd.net
Tue Jan 23 15:37:48 UTC 2024


Hi Ken,


Short(?) version:

1. I think nobody can say what that number is for sure - you'd have to 
look at the pipeline. (see tutorial 11: 
https://gstreamer.freedesktop.org/documentation/tutorials/basic/debugging-tools.html?gi-language=python#getting-pipeline-graphs 
)
2. I'd guess that the number is an average number of units per frame 
that is unlikely to be predictable in advance so will not transfer to 
other files (e.g number of audio samples/frame or number of 
bytes/frame).  You'd be better off using FPS and multiplying by time - 
mainly because FPS is programmatically obtainable and so transferrable 
(though not necessarily set correctly by the encoding process).
3. If I understand correctly Nicolas was saying nothing tracks the frame 
index in the low-level data, so higher level data cannot ask establish 
the frame index after a seek.
4. The only reliable method seems to be to scan the file to step through 
all frames and build a lookup table (e.g. from frame PTS to index 
number)
5. There's gotchas when attempting to seek to the time from the lookup 
table, but looking up frame index from the time should work well (I 
imagine!)
6. How do you know you're really at frame 12345?  May be worth double 
checking?


Long version:

Please take all of my comments as opinion, guesses and with a healthy 
dose of uncertainty bounded by my lack of knowledge and experience with 
gstreamer internals!


My understanding is that the meaning of your 'kludge' number depends on 
what's inside the playbin and which sink(s) the playbin sends your query 
to.
Have you inspected the pipeline graph?
https://gstreamer.freedesktop.org/documentation/tutorials/basic/debugging-tools.html?gi-language=python#getting-pipeline-graphs

As you can see (probably have seen?) from:
    
https://gstreamer.freedesktop.org/documentation/gstreamer/gstformat.html?gi-language=python#GstFormat
The meaning of 'DEFAULT' depends on what element receives the request (a 
video stream MAY return frame numbers - but it clearly isn't in your 
case, pipeline is giving something else - it may not be the video stream 
replying?).

A very _uninformed_ guess would be that your video file has an audio 
stream and what's happening is you're getting the audio-sink replying to 
the query.  In that case the 'DEFAULT' would indicate audio samples 
through the file and the kludge number you are working out would be the 
ratio of audio samples to video frames. Essentially the number of audio 
samples would be generated at a specified number of samples per second, 
and therefore it's really just a measure of time.

IF that's not a million miles from what's going on (i.e. the query is 
giving you a number which is basically equivalent to time, but with a 
different scale) you'd probably find it easier to query stream time 
position and FPS, then multiply the two together.  Both of these 
approaches would require a constant frame rate to have any hope of being 
accurate, but you could estimate it and it may be good enough, depending 
on video data and the accuracy you require.  It would certainly be more 
transferable between videos if you query the frame rate!

Another thing the number may be is 'average number of bytes per frame', 
though the number you've given looks very low for that, and I'd still 
think you'd be better off using the time as a better approximating 
factor. It (almost certainly) will not be predictable or transfer 
between files, unless your footage is from an indoor scene where nothing 
ever happens...


I should flag: I think Nicolas has addressed this from a slightly 
different angle in the other thread:
If I understand correctly he was saying:
   he believe there is no information in the MP4 container or H264 stream 
that tracks the frame number.
The inference I make from that is if a seek occurs, there is no 
component able to "answer the question" what frame index is currently 
being viewed. Before the seek there _may_ be a counter somewhere that 
ticks up every frame, but after a seek nothing can really *know* what 
number frame it has ended up at.

Or to phrase it another way: the MP4 container and H264 stream do not 
care about frame numbers, they only care about timing.  You can ask them 
what time the pipeline is at, but not what frame number is being 
processed  If it is not available from the container or the stream there 
is no hope of any higher level objects getting access to that 
information after a seek, so stepping through the stream is the only way 
to keep track of the frame index.

If your video stream is nice and has a constant frame rate, you can use 
that to convert from time to frame index, but otherwise the information 
isn't available to you without scanning the file and building the lookup 
table yourself!




To discuss around scanning the video and building a lookup: you can 
extract the (stream time) PTS and frame index by stepping through each 
frame. As I do that I also take a hash of the pixel data - because I 
like to be certain that the pixel data really is what I've asked for!  
Some other libraries I have used demonstrated they don't always return 
the same pixels after a seek as they would if the code were stepping 
only, so you may want to consider if you want to hash frames or not, or 
just trust the meta data...


Once you have the lookup, you can use it to either:
    (usage 1) 'seek to the frame' by looking up the corresponding time 
and seeking to that time
    or (usage 2) identify which frame index the pipeline is operating on 
by requesting the buffer's PTS (or frame hash) and looking that up to 
get the frame index


I've been away a while as I've been exploring exaclty how (usage 1) 
would work: the 'seeking to the corresponding time' is more complex than 
it sounds.  The corresponding time isn't the frame PTS, or necessarily 
just before as one would expect, but depends on the type of frame 
(I-frame/P-frame) and the 2 prior frames PTSs and durations as well.
I'll post full details in the other thread "soon" to ensure I'm not 
misunderstanding the evidence and maybe help anyone else with the same 
usage.

Your current question is about identifying the frame index (usage 2) 
from the pipeline's state, so the approach Nicolas suggests may well 
work for this (I haven't seen anything in my testing that would suggest 
an issue with the approach).


Finally,  I'd have to admit:  I'm a little surprised if your code really 
is seeking to frame '12345' as you expect: specifically the 
'GST_SEEK_FLAG_KEY_UNIT'  would suggest that it will seek to a key frame 
near frame '12345', which is probably not 12345, unless you are quite 
lucky  (although that may be why you chose that frame, but I suspect 
it's just a placeholder?), or maybe something else I don't get is 
happening?
I'd encourage you to build the lookup containing buffer.pts and frame 
index, then confirm the next frame after your seek yields the buffer.pts 
you expect for your target frame number, to be sure that you've got the 
accuracy you expect.
Having written all that,  I've been surprised before, so I won't mind 
being shown to be wrong this time!  I just wanted to make you aware of 
my thoughts so you can consider if you want to double check or help 
correct my misunderstanding.

I hope there are some ideas in there that help you move forward!


All the best,

Will.


On 2024-01-17 19:55, Kenneth Feingold via gstreamer-devel wrote:
> Hi Will,
> Thanks very much for working with me on this, to whatever limit extent
> is possible :-)
> My application has a very simple pipeline. It uses playbin to show a
> video file in a gtkgl window:
> 
>   data.playbin = gst_element_factory_make ("playbin", "playbin");
>   videosink = gst_element_factory_make ("glsinkbin", "glsinkbin");
>   gtkglsink = gst_element_factory_make ("gtkglsink", "gtkglsink");
> 
>  When I seek to a frame like this:
> 
> gst_element_seek_simple (data->playbin, GST_FORMAT_DEFAULT,
> GST_SEEK_FLAG_FLUSH |GST_SEEK_FLAG_KEY_UNIT, 12345);
>             send_seek_event (data);
> 
> /*and then in my seek function:*/
> gst_element_send_event (data->video_sink, seek_event);
> 
> It actually takes me to precise frame #12345.
> 
> But, when I try to retrieve frame numbers while playing the file:
> /* Query the current position in frames */
>   if (gst_element_query_position (data->playbin, GST_FORMAT_DEFAULT,
> &current)) {
>     framenum=(current/1599.49);
>     g_print ("Current Frame: %ld\n", framenum);
>   }
> 
> I need to use that specific (kludge) factor
> "framenum=(current/1599.49);" in order to get *close* to the app
> giving me the right frame number. What is curious is that with a
> different video file having a different length I need to use a
> different divisor to get the "right" value. Is this related to media
> duration/stream time?
> 
> I am working with mp4 and mov files, and I am wondering if compression
> between frames is a factor? Would uncompressed video yield greater
> accuracy?
> 
> And, as I was wondering in my earlier post here, what values do you
> think this:
> gst_element_query_position (data->playbin, GST_FORMAT_DEFAULT,
> &current))
> 
> is giving me (without my kludge)?
> 
> Thanks again!
> Ken


More information about the gstreamer-devel mailing list