[libnice] Confusion about running threads of signal callbacks

Mon Jun 14 16:04:02 UTC 2021

Sorry, a mix of holidays and priority stuff got in the way...

On 13/05/2021 16.49, Olivier Crête wrote:
> Out of curiosity, what kind of thread races are you seeing?
>

The problem seems to be that the "new-candidate-full" signal gets
emitted from 2 threads at the same time: once from the main context
thread, and another from the GStreamer's nicesrc element thread. And
these collide when trying to access the gclosure that is created by the
glib signal handler (see below for location of the thread race)

This is the ThreadSanitizer stack trace that I can see whenever I set up
an N:N WebRTC session, where N is at least 4 or so. It doesn't happen
when N is lower, I guess because of the nature of the thread race:

   ==================
   WARNING: ThreadSanitizer: data race (pid=11739)
     Read of size 1 at 0x7b14000af162 by thread T15 (mutexes: write M68454):
       #0 closure_invoke_notifiers ../gobject/gclosure.c:294
(libgobject-2.0.so.0+0x16f67)
       #1 g_closure_invoke ../gobject/gclosure.c:816
(libgobject-2.0.so.0+0x16f67)
       #2 signal_emit_unlocked_R ../gobject/gsignal.c:3741
(libgobject-2.0.so.0+0x332aa)
       #3 g_signal_emitv ../gobject/gsignal.c:3227
(libgobject-2.0.so.0+0x3a3ec)
       #4 agent_unlock_and_emit ../../agent/agent.c:217
(libnice.so.10+0xba86)
       #5 kms_ice_base_agent_start_gathering_candidates
/home/ubuntu/kms-omni-build/kms-elements/src/gst-plugins/webrtcendpoint/kmsicebaseagent.c:383
(libkmswebrtcendpointlib.so.6+0x2147f)

Here, #5 is the call to `nice_agent_gather_candidates()`, and thread T15
is the main thread.

That "read of size 1" is conflicting with the following "write of size
4", which happens from the GStreamer nicesrc element thread:

     Previous atomic write of size 4 at 0x7b14000af160 by thread T90
(mutexes: write M892129492666755424):
       #0 __tsan_atomic32_compare_exchange_strong <null>
(libtsan.so+0x81d2c)
       #1 g_closure_unref ../gobject/gclosure.c:611
(libgobject-2.0.so.0+0x15ec6)
       #2 g_closure_invoke ../gobject/gclosure.c:819
(libgobject-2.0.so.0+0x16e64)
       #3 signal_emit_unlocked_R ../gobject/gsignal.c:3741
(libgobject-2.0.so.0+0x332aa)
       #4 g_signal_emitv ../gobject/gsignal.c:3227
(libgobject-2.0.so.0+0x3a3ec)
       #5 agent_unlock_and_emit ../../agent/agent.c:217
(libnice.so.10+0xba86)
       #6 g_main_dispatch ../glib/gmain.c:3337 (libglib-2.0.so.0+0x6bbf5)
       #7 g_main_context_dispatch ../glib/gmain.c:4055
(libglib-2.0.so.0+0x6bbf5)
       #8 g_main_context_iterate ../glib/gmain.c:4131
(libglib-2.0.so.0+0x6c13f)
       #9 g_main_loop_run ../glib/gmain.c:4329 (libglib-2.0.so.0+0x6c59f)
       #10 <null> <null> (libgstnice.so+0x2eb3)
       #11 g_thread_proxy ../glib/gthread.c:826 (libglib-2.0.so.0+0xaa2ef)
       #12 linux_pthread_proxy ../glib/gthread-posix.c:1269
(libglib-2.0.so.0+0xe23d8)
       #13 <null> <null> (libtsan.so+0x2c34f)

I'm still trying to get the debug symbols of libgstnice.so to work
properly. Meanwhile, confirmation that T90 was the nicesrc GStreamer
plugin...

     Thread T90 'nicesrc16:src' (tid=11902, running) created by thread
T12 at:
       #0 pthread_create <null> (libtsan.so+0x5d445)
       #1 g_system_thread_new ../glib/gthread-posix.c:1323
(libglib-2.0.so.0+0xe345b)
       #2 g_thread_new_internal ../glib/gthread.c:931
(libglib-2.0.so.0+0xaaa03)
       #3 g_thread_pool_start_thread ../glib/gthreadpool.c:477
(libglib-2.0.so.0+0xab537)
       #4 g_thread_pool_push ../glib/gthreadpool.c:691
(libglib-2.0.so.0+0xabbfd)
       #5 default_push gst/gsttaskpool.c:122 (libgstreamer-1.5.so.0+0xa1d79)
       #6 kms_webrtc_bundle_connection_src_sync_state_with_parent
/home/ubuntu/kms-omni-build/kms-elements/src/gst-plugins/webrtcendpoint/kmswebrtcbundleconnection.c:109
(libkmswebrtcendpointlib.so.6+0xc898)

where #6 here is a call to
`gst_element_sync_state_with_parent_target_state()`.

This is where most of my confusion comes from. Seems like if libnice was
colliding with itself from the pipeline's element?

Location of the thread race seems to be the closure that is created to
handle emission of signals

     Location is heap block of size 72 at 0x7b14000af140 allocated by
thread T9:
       #0 calloc <null> (libtsan.so+0x2f6b6)
       #1 g_malloc0 ../glib/gmem.c:136 (libglib-2.0.so.0+0x74bac)
       #2 g_closure_new_simple ../gobject/gclosure.c:218
(libgobject-2.0.so.0+0x14e02)
       #3 g_cclosure_new ../gobject/gclosure.c:952
(libgobject-2.0.so.0+0x1755e)
       #4 g_signal_connect_data ../gobject/gsignal.c:2601
(libgobject-2.0.so.0+0x39209)
       #5 kms_ice_nice_agent_new
/home/ubuntu/kms-omni-build/kms-elements/src/gst-plugins/webrtcendpoint/kmsiceniceagent.c:402
(libkmswebrtcendpointlib.so.6+0x2345c)

where #5 is this:

   g_signal_connect (self->priv->agent, "new-candidate-full",
       G_CALLBACK (kms_ice_nice_agent_new_candidate_full), self);

so, nothing more than a signal handler.

This signal handler `kms_ice_nice_agent_new_candidate_full()` was
originally emitting a signal itself, with `g_signal_emit_by_name()`. The
thread race was then happening inside my own code, after
`agent_unlock_and_emit()`, I assume caused by `g_signal_emit_by_name()`
*NOT* being thread-safe.

So what I did (current scenario as of now) was to wrap all my signal
emissions within `g_main_context_invoke_full()`, to make sure that all
my calls to `g_signal_emit_by_name()` happen from within the same
context thread (the one that libnice constructor was passed).

This change did improve things (apparently?): my code wasn't involved
any more in the stack traces from the point where libnice wants to emit.
So, the present scenario is that my application code is not touched
between the `agent_unlock_and_emit()` and the point of the thread race.

I really wonder why the thread data race is always coming from
*nicesrc*. This element is always present in the ThreadSanitizer stack
traces.

Any idea or suggestion?

Kind regards,
Juan