[libnice] Using (abusing?) libnice's event loop for outgoing traffic

Wed May 9 13:50:17 UTC 2018

Hi all,

as you may know, I'm using libnice in Janus, a WebRTC server. The way
it's used right now is with two different threads per ICE agent: one
runs the agent's GMainContext+GMainLoop, and as such is responsible
for notifying the application about incoming events and packets via
the libnice callback; another thread handles outgoing traffic, with a
GAsyncQueue to queue packets, prepare them and shoot them out via
nice_agent_send() calls.

These past few days I've been playing with an attempt to actually put
those two activities together in a single thread, which would simplify
things (e.g., in terms of locking and other things) and, ideally,
optimize resources (we'd only spawn half the threads we do now). To do
so, I decided to try and re-use the agent's event loop for that, and
followed an excellent blog post by Philip Withnall to create my own
GSource for the purpose:
https://tecnocode.co.uk/2015/05/05/a-detailed-look-at-gsource/
This was very helpful, as I ended up doing something very similar,
since I was already using a GAsyncQueue for outgoing media myself.

Anyway, while this "works", the outgoing media is quite delayed when
rendered by the receiving browser. I verified that media from clients
does come in at the correct rate and with no delays there, by
redirecting the traffic to a monitoring gstreamer pipeline, which
means somehow the outgoing path is to blame. I "tracked" packets both
using wireshark and log lines (e.g., when they were queued and when
they were dispatched by the GSource), and did notice that some packets
are handled fast, while others much later, and there are "holes" when
no packet is sent (for ~500ms at times).  This shouldn't be the code
failing to keep up with the things to do in time (since there's SRTP,
there's encryption involved for every packet), as the CPU usage is
always quite low.

At first I thought this could be ascribed to GSource priorities, but
even playing with those nothing changed. One thing I noticed, though,
was that the delay was quite more apparent in sendonly connections
than in sendrecv ones. This means that, while in an EchoTest demo it
is barely noticeable (even though still worse than with the two
threads), in the VideoRoom demo, where you have monodirectional agents
(some used to just publish media, others just to subscribe), it is
much more evident (~600ms). Considering that the publisher's media is
fine (as confirmed by the gstreamer monitor mentioned above), the only
explanation I came up with was that, while in a bidirectional
communication there's a lot of incoming traffic, on a subscriber's
connection you only get occasional RTCP packets from time to time. In
case there's some sort of timed poll deep in the libnice code, due to
a single thread processing both directions this might mean having the
event loop kept busy waiting for a file descriptor to generate an
event, while outgoing packets pile up in the queue.

Does this make sense, and might this indeed be what's happening? Is
what I'm trying to do indeed feasible or are there better ways to try
and do it properly, e.g., via nice_agent_recv_messages instead of the
recv callback? This is my first attempt at using Glib events in a more
"conversational" way than for sporadic events, and so I realize I may
be doing some rookie mistakes here.

Thanks!
Lorenzo