[libnice] Using (abusing?) libnice's event loop for outgoing traffic

Fri May 11 19:10:51 UTC 2018

Hi,

Oh then your design totally makes sense.

I always found the documentation of the GLib
GMainContext/GMainLoop/GSource combo to not be very complete.

A GSource contains 4 things:
1. a timeout that can be now, never or somewhere in between
2. 0 or more file descriptors with the conditions to poll for
3. Possibly children GSources (the callback for a source is called if
any child is triggered)
4. a callback function

A GMainContext just contains a list of GSources. On each iteration, it
goes over the whole last, takes the smallest timeout and all fds. Then
it calls poll() with those fds and this timeout, afterwards it goes
over the GSources again and calls the callback for any source for which
the timeout has expired or there was something on the FD.

So GLib has no special timeout for sockets.. The only thing you have to
be careful with is g_socket_set_timeout(), but it's evil and indeed
libnice doesn't touch it. Do you wakeup the GMainContext when you add
something to your GAsyncQueue (ie call g_main_context_wakeup() from
your custom source or set the ready time of the source to 0).

Olivier

On Fri, 2018-05-11 at 20:55 +0200, Lorenzo Miniero wrote:
> 2018-05-11 19:50 GMT+02:00 Olivier Crête <olivier.crete at collabora.com
> >:
> > Hi,
> > 
> > I have no idea what is going on with your code. Are you doing any
> > blocking processing the the code that comes out of the recv
> > command?
> > You know that you can attach the receive callback to a separate
> > thread
> > each if required?
> > 
> > If three is a deadlock and it blocks forever while calling send
> > from
> > the recv callback, I'd be very interested in a stack trace, as this
> > is
> > definitely supposed to work.
> > 
> 
> 
> There may have been a misunderstanding... of course you don't know
> what's happening in the code, and you don't need to. In my original
> post I only briefly described how I'm refactoring the code for
> context, and just to give you a better understanding of what I'm
> trying to achieve using the Glib event loops.
> 
> I guess that I probably have mudded the waters and made everything
> more confusing in my second post instead, where I described why we
> stopped sending stuff directly (and so why we can't go the way you
> suggested), which we stopped doing years ago anyway. There never was
> a
> deadlock even there, so nothing for you to investigate or need a
> trace
> for. The problem at the time was simpler (and again, not something
> you
> to fix here, I'm just giving context). At the time this is what we
> did:
> 
> 1. libnice recv callback would fire: we decrypted SRTP, and passed
> the
> packet to a media plugin via a callback;
> 2. most media plugins just relay media as it is, so this media plugin
> would iterate on all recipients and ask the Janus core to send the
> packet to them;
> 3. the Janus core, still on the same thread that originated 1., would
> encrypt SRTP and do a nice_agent_send for each of the recipients.
> 
> Under normal circumstances, no issues there. It wouldn't be an issue
> in 1-1 communications at all. When you have MANY recipients for the
> same source, though, all those SRTP encrypts take their toll, and
> since they were being done on the same thread of 1., it meant we kept
> libnice busy for way too long at times; since this happened for every
> packet, libnice would fail to catchup with incoming packets, delivery
> would slowly but consistently delay, and eventually packets ended up
> being dropped because the incoming queue would pile too much. Nothing
> broken or flawed in libnice, here, we were just following the wrong
> pattern, and so we changed it.
> 
> We ended up with a new GAsyncQueue + send thread for each
> participant.
> This way, in step 3. the core simply adds a packet to the queue and
> returns immediately. The thread responsible for 1. is not kept busy,
> and libnice can keep on receiving packets timely, never losing a bit.
> A dedicated thread goes over the queue and does the SRTP encrypts +
> nice_agent_send instead.
> 
> What we're now trying to do is get rid of that additional thread, and
> do the SRTP+send stuff in the same thread libnice is using.
> Considering there would just be one encryption context here, there
> wouldn't be the performance impact we had to drive away from. The
> issue here is that we're still seeing delays in the delivery, as
> explained in the first post, and hence my question on whether
> libnice+glib may be contributing to that (more on this in my inline
> answer below).
> 
> 
> > All that g_socket_create_source() does it add the fd to the list of
> > FDs
> > to be given to the blocking poll() system call (which is the heart
> > of
> > g_main_context_iterate() which is called repeatedly by
> > g_main_loop_run()).
> 
> 
> Got it. This still means there's a poll involved, though, and polls
> may have a timeout before they return. Especially for incoming
> traffic, it's common practice to not just return immediately if
> there's nothing to read, but do a poll that lasts, let's say, one
> second instead. This helps with CPU usage, as you don't have to poll
> all the time until something happens, but it does keep that thread
> busy until the poll returns. This is why I was wondering if libnice
> (or glib here) were doing something similar: in fact, since in our
> new
> "glib loop does everything" approach the outgoing traffic is a source
> as well, if the source responsible for incoming traffic is stuck
> there
> for a bit longer than expected because there's not much traffic
> (which
> is the case for sendonly connections where you only get some RTCP
> packets from time to time), then the events associated to the
> outgoing
> traffic source may be dispatched later than they should, and cause
> the
> delayed delivery. Again, this is just what I thought might be
> happening from pure observation, since in a sendrecv connection the
> delay seemed much lower than in a sendonly one, which might be
> explained by a FD waking up much more often.
> 
> Now, I'm not familiar with how glib uses poll() for file descriptors,
> so I may or may not be on the right way in my investigation. I'll
> have
> a look at the code to see how that works: looking at the
> documentation
> for g_socket_set_timeout, it looks like the default timeout is 0
> (which means "never timeout"), and I don't see it overridden in the
> libnice code, so not sure what's happening there. I'll try digging
> some more.
> 
> Sorry for the long mail, and thanks for your feedback.
> 
> Have a nice weekend!
> Lorenzo
> 
> 
> > 
> > Olivier
> > 
> > On Fri, 2018-05-11 at 19:41 +0200, Lorenzo Miniero wrote:
> > > Hi Olivier,
> > > 
> > > this is what we did originally, but we soon had to change the
> > > pattern.
> > > In fact, it's not just a nice_agent_send call, as for each
> > > outgoing
> > > packet we have to do an SRTP encryption, and the same source can
> > > actually send to many destinations (hundreds) at the same time:
> > > this
> > > did end up being blocking or a source of severe delays for the
> > > media
> > > plugins in Janus, for which sending a packet is supposed to be
> > > "asynchronous" no matter how many recipients there are; it also
> > > caused
> > > lost packets on the incoming side, as the original trigger for
> > > incoming packets in Janus is the libnice recv callback, which
> > > means
> > > that sending to hundreds of recipients was done on the libnice
> > > loop
> > > thread, and that could take forever. Besides, libsrtp is not
> > > thread
> > > safe, and having a unique point where SRTP encryption would
> > > happen
> > > avoided the need for a lock.
> > > 
> > > This is why we ended up with a dedicated thread per participant,
> > > which
> > > is fine, but I was investigating for a way to only use a single
> > > thread
> > > for both in and out. Do you know if there's anything in libnice,
> > > configurable or not, that may contribute to the problems I'm
> > > experiencing? I tried looking at the code, but all I could find
> > > was
> > > that apparently g_socket_create_source is used for incoming
> > > media,
> > > and
> > > it's not clear from the documentation how it works internally
> > > (e.g.,
> > > in terms of polling, how much it polls before giving up, etc.).
> > > 
> > > Lorenzo
> > > 
> > > 
> > > 2018-05-11 19:31 GMT+02:00 Olivier Crête <olivier.crete at collabora
> > > .com
> > > > :
> > > > Hi,
> > > > 
> > > > I would just get rid of the send thread and the GASyncQueue
> > > > completely.
> > > >   Sending in libnice is non-blocking, so you should be able to
> > > > just
> > > > call the nice_agent_send() at the place where you would
> > > > normally
> > > > put
> > > > the things in a queue.
> > > > 
> > > > Olivier
> > > > 
> > > > On Wed, 2018-05-09 at 15:50 +0200, Lorenzo Miniero wrote:
> > > > > Hi all,
> > > > > 
> > > > > as you may know, I'm using libnice in Janus, a WebRTC server.
> > > > > The
> > > > > way
> > > > > it's used right now is with two different threads per ICE
> > > > > agent:
> > > > > one
> > > > > runs the agent's GMainContext+GMainLoop, and as such is
> > > > > responsible
> > > > > for notifying the application about incoming events and
> > > > > packets
> > > > > via
> > > > > the libnice callback; another thread handles outgoing
> > > > > traffic,
> > > > > with a
> > > > > GAsyncQueue to queue packets, prepare them and shoot them out
> > > > > via
> > > > > nice_agent_send() calls.
> > > > > 
> > > > > These past few days I've been playing with an attempt to
> > > > > actually
> > > > > put
> > > > > those two activities together in a single thread, which would
> > > > > simplify
> > > > > things (e.g., in terms of locking and other things) and,
> > > > > ideally,
> > > > > optimize resources (we'd only spawn half the threads we do
> > > > > now).
> > > > > To
> > > > > do
> > > > > so, I decided to try and re-use the agent's event loop for
> > > > > that,
> > > > > and
> > > > > followed an excellent blog post by Philip Withnall to create
> > > > > my
> > > > > own
> > > > > GSource for the purpose:
> > > > > https://tecnocode.co.uk/2015/05/05/a-detailed-look-at-gsource
> > > > > /
> > > > > This was very helpful, as I ended up doing something very
> > > > > similar,
> > > > > since I was already using a GAsyncQueue for outgoing media
> > > > > myself.
> > > > > 
> > > > > Anyway, while this "works", the outgoing media is quite
> > > > > delayed
> > > > > when
> > > > > rendered by the receiving browser. I verified that media from
> > > > > clients
> > > > > does come in at the correct rate and with no delays there, by
> > > > > redirecting the traffic to a monitoring gstreamer pipeline,
> > > > > which
> > > > > means somehow the outgoing path is to blame. I "tracked"
> > > > > packets
> > > > > both
> > > > > using wireshark and log lines (e.g., when they were queued
> > > > > and
> > > > > when
> > > > > they were dispatched by the GSource), and did notice that
> > > > > some
> > > > > packets
> > > > > are handled fast, while others much later, and there are
> > > > > "holes"
> > > > > when
> > > > > no packet is sent (for ~500ms at times).  This shouldn't be
> > > > > the
> > > > > code
> > > > > failing to keep up with the things to do in time (since
> > > > > there's
> > > > > SRTP,
> > > > > there's encryption involved for every packet), as the CPU
> > > > > usage
> > > > > is
> > > > > always quite low.
> > > > > 
> > > > > At first I thought this could be ascribed to GSource
> > > > > priorities,
> > > > > but
> > > > > even playing with those nothing changed. One thing I noticed,
> > > > > though,
> > > > > was that the delay was quite more apparent in sendonly
> > > > > connections
> > > > > than in sendrecv ones. This means that, while in an EchoTest
> > > > > demo
> > > > > it
> > > > > is barely noticeable (even though still worse than with the
> > > > > two
> > > > > threads), in the VideoRoom demo, where you have
> > > > > monodirectional
> > > > > agents
> > > > > (some used to just publish media, others just to subscribe),
> > > > > it
> > > > > is
> > > > > much more evident (~600ms). Considering that the publisher's
> > > > > media is
> > > > > fine (as confirmed by the gstreamer monitor mentioned above),
> > > > > the
> > > > > only
> > > > > explanation I came up with was that, while in a bidirectional
> > > > > communication there's a lot of incoming traffic, on a
> > > > > subscriber's
> > > > > connection you only get occasional RTCP packets from time to
> > > > > time. In
> > > > > case there's some sort of timed poll deep in the libnice
> > > > > code,
> > > > > due to
> > > > > a single thread processing both directions this might mean
> > > > > having
> > > > > the
> > > > > event loop kept busy waiting for a file descriptor to
> > > > > generate an
> > > > > event, while outgoing packets pile up in the queue.
> > > > > 
> > > > > Does this make sense, and might this indeed be what's
> > > > > happening?
> > > > > Is
> > > > > what I'm trying to do indeed feasible or are there better
> > > > > ways to
> > > > > try
> > > > > and do it properly, e.g., via nice_agent_recv_messages
> > > > > instead of
> > > > > the
> > > > > recv callback? This is my first attempt at using Glib events
> > > > > in a
> > > > > more
> > > > > "conversational" way than for sporadic events, and so I
> > > > > realize I
> > > > > may
> > > > > be doing some rookie mistakes here.
> > > > > 
> > > > > Thanks!
> > > > > Lorenzo
> > > > > _______________________________________________
> > > > > nice mailing list
> > > > > nice at lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/nice
> > > > 
> > > > --
> > > > Olivier Crête
> > > > olivier.crete at collabora.com
> > 
> > --
> > Olivier Crête
> > olivier.crete at collabora.com
-- 
Olivier Crête
olivier.crete at collabora.com