webrtcdsp voice detection

Dejan Cotra Dejan.Cotra at nttdata.com
Tue Apr 19 15:23:08 UTC 2022


Hi,

Thanks one more time, again that was very helpful. I manage to retrieve ROI meta form video frames.

I have one more question GstAudioLevelMeta and voice_activity Boolean.

I have simple code with this pipeline

pipeline = Gst.parse_launch("directsoundsrc !  \
                             level audio-level-meta=true post-messages=false message=false name=elLev ! \
                             webrtcdsp name=dsp echo-cancel=false gain-control=false high-pass-filter=false limiter=false noise-suppression=false voice-detection=true  voice-detection-likelihood=high  voice-detection-frame-size-ms=10 ! \
                             appsink name=sink emit-signals=true")

appsink = pipeline.get_by_name("sink")
appsink.connect("new-sample", on_buffer, None)

and

def on_buffer(sink: GstApp.AppSink, data: typ.Any) -> Gst.FlowReturn:

    sample = sink.emit("pull-sample")  # Gst.Sample
    buffer = sample.get_buffer()  # Gst.Buffer
    ll = GstAudio.buffer_get_audio_level_meta(buffer)
    text_file.write(str(ll.voice_activity) + '\n')

    return Gst.FlowReturn.OK

If I use bus message from dsp element to check for voice activity I get correct results more or less. When Im speaking stream-has-voice is on true and when I stop its  changes to false.

But if I use audio_level_meta I never get 2 consecutive buffers with voice_activity set to True.
If I start speaking I get one buffer with voice_activity set to True and next one is False and it stays False until I stop speaking and start speaking again then again I get one buffer with True and next one is again False.

Do I miss something here? Or this is expected behavior.

Br,
Dejan

From: Nicolas Dufresne <nicolas at ndufresne.ca>
Sent: Friday, April 15, 2022 7:54 PM
To: Dejan Cotra <Dejan.Cotra at nttdata.com>
Cc: Discussion of the development of and with GStreamer <gstreamer-devel at lists.freedesktop.org>
Subject: Re: webrtcdsp voice detection


Le ven. 15 avr. 2022 08 h 47, Dejan Cotra <Dejan.Cotra at nttdata.com<mailto:Dejan.Cotra at nttdata.com>> a écrit :
Hi Nicolas,

Thank you that was very helpful.

I have one similar question. I also play around with facedetect element. I know that I can retrieve information about face from bus message emitted by facedetect element.

Is there a way to retrieve informations about face from video frame metainfo? Something similar to voice_activity in GstAudioLevelMeta?

I'm currently away from a real computer to check the code, though the "in-band" way is GstVideoRegionOfInterest. ROI are rectangles with a type (a simple string). So facedetect should be adding ROI meta to the frames.

A typical use case is to detect these with a pad probe and set a qp-delta (not sure of the name) so that capable encoders can be told to work harder on the details of that rectangle.

This could also be used for other purposes. Note that ONNX (ML plugins we have) tend to prefer having more shapes then just rectangles, so they have their own meta.


Br,
Dejan

-----Original Message-----
From: Nicolas Dufresne <nicolas at ndufresne.ca<mailto:nicolas at ndufresne.ca>>
Sent: Friday, April 8, 2022 3:31 PM
To: Discussion of the development of and with GStreamer <gstreamer-devel at lists.freedesktop.org<mailto:gstreamer-devel at lists.freedesktop.org>>
Cc: Dejan Cotra <Dejan.Cotra at nttdata.com<mailto:Dejan.Cotra at nttdata.com>>
Subject: Re: webrtcdsp voice detection

Le vendredi 08 avril 2022 à 11:10 +0000, Dejan Cotra via gstreamer-devel a écrit :

[...]
>
> I know that I can retrieve informations from webrtcdsp voice detection
> via bus messages. I receive GST_MESSAGE_ELEMENT message from webrtcdsp
> element with payload like this:
>
> voice-activity, stream-time=(guint64)2640000000, stream-has-
> voice=(boolean)false;
>
> Question is can I retrieve informations about voice detection in some
> other way. Like metainfo of each sample that I pull from appsink
> element? Or something similar?

It also sets the voice_activity boolean in GstAudioLevelMeta (along with the audio amplitude). This is per buffers, not per samples. So you get feedback every 10ms more or less.

Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/gstreamer-devel/attachments/20220419/e1d39844/attachment-0001.htm>


More information about the gstreamer-devel mailing list