[libnice] Using (abusing?) libnice's event loop for outgoing traffic

Lorenzo Miniero lminiero at gmail.com
Fri May 11 19:25:49 UTC 2018


2018-05-11 21:10 GMT+02:00 Olivier Crête <olivier.crete at collabora.com>:
> Hi,
>
> Oh then your design totally makes sense.
>
> I always found the documentation of the GLib
> GMainContext/GMainLoop/GSource combo to not be very complete.
>
> A GSource contains 4 things:
> 1. a timeout that can be now, never or somewhere in between
> 2. 0 or more file descriptors with the conditions to poll for
> 3. Possibly children GSources (the callback for a source is called if
> any child is triggered)
> 4. a callback function
>
> A GMainContext just contains a list of GSources. On each iteration, it
> goes over the whole last, takes the smallest timeout and all fds. Then
> it calls poll() with those fds and this timeout, afterwards it goes
> over the GSources again and calls the callback for any source for which
> the timeout has expired or there was something on the FD.
>
> So GLib has no special timeout for sockets.. The only thing you have to
> be careful with is g_socket_set_timeout(), but it's evil and indeed
> libnice doesn't touch it. Do you wakeup the GMainContext when you add
> something to your GAsyncQueue (ie call g_main_context_wakeup() from
> your custom source or set the ready time of the source to 0).
>


Amazing, g_main_context_wakeup() did the trick, it's working perfectly
now! Thanks a ton for the explanation on the glib loop internals, and
most importantly for the tip that completely turned my week :-)

Lorenzo


> Olivier
>
>
> On Fri, 2018-05-11 at 20:55 +0200, Lorenzo Miniero wrote:
>> 2018-05-11 19:50 GMT+02:00 Olivier Crête <olivier.crete at collabora.com
>> >:
>> > Hi,
>> >
>> > I have no idea what is going on with your code. Are you doing any
>> > blocking processing the the code that comes out of the recv
>> > command?
>> > You know that you can attach the receive callback to a separate
>> > thread
>> > each if required?
>> >
>> > If three is a deadlock and it blocks forever while calling send
>> > from
>> > the recv callback, I'd be very interested in a stack trace, as this
>> > is
>> > definitely supposed to work.
>> >
>>
>>
>> There may have been a misunderstanding... of course you don't know
>> what's happening in the code, and you don't need to. In my original
>> post I only briefly described how I'm refactoring the code for
>> context, and just to give you a better understanding of what I'm
>> trying to achieve using the Glib event loops.
>>
>> I guess that I probably have mudded the waters and made everything
>> more confusing in my second post instead, where I described why we
>> stopped sending stuff directly (and so why we can't go the way you
>> suggested), which we stopped doing years ago anyway. There never was
>> a
>> deadlock even there, so nothing for you to investigate or need a
>> trace
>> for. The problem at the time was simpler (and again, not something
>> you
>> to fix here, I'm just giving context). At the time this is what we
>> did:
>>
>> 1. libnice recv callback would fire: we decrypted SRTP, and passed
>> the
>> packet to a media plugin via a callback;
>> 2. most media plugins just relay media as it is, so this media plugin
>> would iterate on all recipients and ask the Janus core to send the
>> packet to them;
>> 3. the Janus core, still on the same thread that originated 1., would
>> encrypt SRTP and do a nice_agent_send for each of the recipients.
>>
>> Under normal circumstances, no issues there. It wouldn't be an issue
>> in 1-1 communications at all. When you have MANY recipients for the
>> same source, though, all those SRTP encrypts take their toll, and
>> since they were being done on the same thread of 1., it meant we kept
>> libnice busy for way too long at times; since this happened for every
>> packet, libnice would fail to catchup with incoming packets, delivery
>> would slowly but consistently delay, and eventually packets ended up
>> being dropped because the incoming queue would pile too much. Nothing
>> broken or flawed in libnice, here, we were just following the wrong
>> pattern, and so we changed it.
>>
>> We ended up with a new GAsyncQueue + send thread for each
>> participant.
>> This way, in step 3. the core simply adds a packet to the queue and
>> returns immediately. The thread responsible for 1. is not kept busy,
>> and libnice can keep on receiving packets timely, never losing a bit.
>> A dedicated thread goes over the queue and does the SRTP encrypts +
>> nice_agent_send instead.
>>
>> What we're now trying to do is get rid of that additional thread, and
>> do the SRTP+send stuff in the same thread libnice is using.
>> Considering there would just be one encryption context here, there
>> wouldn't be the performance impact we had to drive away from. The
>> issue here is that we're still seeing delays in the delivery, as
>> explained in the first post, and hence my question on whether
>> libnice+glib may be contributing to that (more on this in my inline
>> answer below).
>>
>>
>> > All that g_socket_create_source() does it add the fd to the list of
>> > FDs
>> > to be given to the blocking poll() system call (which is the heart
>> > of
>> > g_main_context_iterate() which is called repeatedly by
>> > g_main_loop_run()).
>>
>>
>> Got it. This still means there's a poll involved, though, and polls
>> may have a timeout before they return. Especially for incoming
>> traffic, it's common practice to not just return immediately if
>> there's nothing to read, but do a poll that lasts, let's say, one
>> second instead. This helps with CPU usage, as you don't have to poll
>> all the time until something happens, but it does keep that thread
>> busy until the poll returns. This is why I was wondering if libnice
>> (or glib here) were doing something similar: in fact, since in our
>> new
>> "glib loop does everything" approach the outgoing traffic is a source
>> as well, if the source responsible for incoming traffic is stuck
>> there
>> for a bit longer than expected because there's not much traffic
>> (which
>> is the case for sendonly connections where you only get some RTCP
>> packets from time to time), then the events associated to the
>> outgoing
>> traffic source may be dispatched later than they should, and cause
>> the
>> delayed delivery. Again, this is just what I thought might be
>> happening from pure observation, since in a sendrecv connection the
>> delay seemed much lower than in a sendonly one, which might be
>> explained by a FD waking up much more often.
>>
>> Now, I'm not familiar with how glib uses poll() for file descriptors,
>> so I may or may not be on the right way in my investigation. I'll
>> have
>> a look at the code to see how that works: looking at the
>> documentation
>> for g_socket_set_timeout, it looks like the default timeout is 0
>> (which means "never timeout"), and I don't see it overridden in the
>> libnice code, so not sure what's happening there. I'll try digging
>> some more.
>>
>> Sorry for the long mail, and thanks for your feedback.
>>
>> Have a nice weekend!
>> Lorenzo
>>
>>
>> >
>> > Olivier
>> >
>> > On Fri, 2018-05-11 at 19:41 +0200, Lorenzo Miniero wrote:
>> > > Hi Olivier,
>> > >
>> > > this is what we did originally, but we soon had to change the
>> > > pattern.
>> > > In fact, it's not just a nice_agent_send call, as for each
>> > > outgoing
>> > > packet we have to do an SRTP encryption, and the same source can
>> > > actually send to many destinations (hundreds) at the same time:
>> > > this
>> > > did end up being blocking or a source of severe delays for the
>> > > media
>> > > plugins in Janus, for which sending a packet is supposed to be
>> > > "asynchronous" no matter how many recipients there are; it also
>> > > caused
>> > > lost packets on the incoming side, as the original trigger for
>> > > incoming packets in Janus is the libnice recv callback, which
>> > > means
>> > > that sending to hundreds of recipients was done on the libnice
>> > > loop
>> > > thread, and that could take forever. Besides, libsrtp is not
>> > > thread
>> > > safe, and having a unique point where SRTP encryption would
>> > > happen
>> > > avoided the need for a lock.
>> > >
>> > > This is why we ended up with a dedicated thread per participant,
>> > > which
>> > > is fine, but I was investigating for a way to only use a single
>> > > thread
>> > > for both in and out. Do you know if there's anything in libnice,
>> > > configurable or not, that may contribute to the problems I'm
>> > > experiencing? I tried looking at the code, but all I could find
>> > > was
>> > > that apparently g_socket_create_source is used for incoming
>> > > media,
>> > > and
>> > > it's not clear from the documentation how it works internally
>> > > (e.g.,
>> > > in terms of polling, how much it polls before giving up, etc.).
>> > >
>> > > Lorenzo
>> > >
>> > >
>> > > 2018-05-11 19:31 GMT+02:00 Olivier Crête <olivier.crete at collabora
>> > > .com
>> > > > :
>> > > > Hi,
>> > > >
>> > > > I would just get rid of the send thread and the GASyncQueue
>> > > > completely.
>> > > >   Sending in libnice is non-blocking, so you should be able to
>> > > > just
>> > > > call the nice_agent_send() at the place where you would
>> > > > normally
>> > > > put
>> > > > the things in a queue.
>> > > >
>> > > > Olivier
>> > > >
>> > > > On Wed, 2018-05-09 at 15:50 +0200, Lorenzo Miniero wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > as you may know, I'm using libnice in Janus, a WebRTC server.
>> > > > > The
>> > > > > way
>> > > > > it's used right now is with two different threads per ICE
>> > > > > agent:
>> > > > > one
>> > > > > runs the agent's GMainContext+GMainLoop, and as such is
>> > > > > responsible
>> > > > > for notifying the application about incoming events and
>> > > > > packets
>> > > > > via
>> > > > > the libnice callback; another thread handles outgoing
>> > > > > traffic,
>> > > > > with a
>> > > > > GAsyncQueue to queue packets, prepare them and shoot them out
>> > > > > via
>> > > > > nice_agent_send() calls.
>> > > > >
>> > > > > These past few days I've been playing with an attempt to
>> > > > > actually
>> > > > > put
>> > > > > those two activities together in a single thread, which would
>> > > > > simplify
>> > > > > things (e.g., in terms of locking and other things) and,
>> > > > > ideally,
>> > > > > optimize resources (we'd only spawn half the threads we do
>> > > > > now).
>> > > > > To
>> > > > > do
>> > > > > so, I decided to try and re-use the agent's event loop for
>> > > > > that,
>> > > > > and
>> > > > > followed an excellent blog post by Philip Withnall to create
>> > > > > my
>> > > > > own
>> > > > > GSource for the purpose:
>> > > > > https://tecnocode.co.uk/2015/05/05/a-detailed-look-at-gsource
>> > > > > /
>> > > > > This was very helpful, as I ended up doing something very
>> > > > > similar,
>> > > > > since I was already using a GAsyncQueue for outgoing media
>> > > > > myself.
>> > > > >
>> > > > > Anyway, while this "works", the outgoing media is quite
>> > > > > delayed
>> > > > > when
>> > > > > rendered by the receiving browser. I verified that media from
>> > > > > clients
>> > > > > does come in at the correct rate and with no delays there, by
>> > > > > redirecting the traffic to a monitoring gstreamer pipeline,
>> > > > > which
>> > > > > means somehow the outgoing path is to blame. I "tracked"
>> > > > > packets
>> > > > > both
>> > > > > using wireshark and log lines (e.g., when they were queued
>> > > > > and
>> > > > > when
>> > > > > they were dispatched by the GSource), and did notice that
>> > > > > some
>> > > > > packets
>> > > > > are handled fast, while others much later, and there are
>> > > > > "holes"
>> > > > > when
>> > > > > no packet is sent (for ~500ms at times).  This shouldn't be
>> > > > > the
>> > > > > code
>> > > > > failing to keep up with the things to do in time (since
>> > > > > there's
>> > > > > SRTP,
>> > > > > there's encryption involved for every packet), as the CPU
>> > > > > usage
>> > > > > is
>> > > > > always quite low.
>> > > > >
>> > > > > At first I thought this could be ascribed to GSource
>> > > > > priorities,
>> > > > > but
>> > > > > even playing with those nothing changed. One thing I noticed,
>> > > > > though,
>> > > > > was that the delay was quite more apparent in sendonly
>> > > > > connections
>> > > > > than in sendrecv ones. This means that, while in an EchoTest
>> > > > > demo
>> > > > > it
>> > > > > is barely noticeable (even though still worse than with the
>> > > > > two
>> > > > > threads), in the VideoRoom demo, where you have
>> > > > > monodirectional
>> > > > > agents
>> > > > > (some used to just publish media, others just to subscribe),
>> > > > > it
>> > > > > is
>> > > > > much more evident (~600ms). Considering that the publisher's
>> > > > > media is
>> > > > > fine (as confirmed by the gstreamer monitor mentioned above),
>> > > > > the
>> > > > > only
>> > > > > explanation I came up with was that, while in a bidirectional
>> > > > > communication there's a lot of incoming traffic, on a
>> > > > > subscriber's
>> > > > > connection you only get occasional RTCP packets from time to
>> > > > > time. In
>> > > > > case there's some sort of timed poll deep in the libnice
>> > > > > code,
>> > > > > due to
>> > > > > a single thread processing both directions this might mean
>> > > > > having
>> > > > > the
>> > > > > event loop kept busy waiting for a file descriptor to
>> > > > > generate an
>> > > > > event, while outgoing packets pile up in the queue.
>> > > > >
>> > > > > Does this make sense, and might this indeed be what's
>> > > > > happening?
>> > > > > Is
>> > > > > what I'm trying to do indeed feasible or are there better
>> > > > > ways to
>> > > > > try
>> > > > > and do it properly, e.g., via nice_agent_recv_messages
>> > > > > instead of
>> > > > > the
>> > > > > recv callback? This is my first attempt at using Glib events
>> > > > > in a
>> > > > > more
>> > > > > "conversational" way than for sporadic events, and so I
>> > > > > realize I
>> > > > > may
>> > > > > be doing some rookie mistakes here.
>> > > > >
>> > > > > Thanks!
>> > > > > Lorenzo
>> > > > > _______________________________________________
>> > > > > nice mailing list
>> > > > > nice at lists.freedesktop.org
>> > > > > https://lists.freedesktop.org/mailman/listinfo/nice
>> > > >
>> > > > --
>> > > > Olivier Crête
>> > > > olivier.crete at collabora.com
>> >
>> > --
>> > Olivier Crête
>> > olivier.crete at collabora.com
> --
> Olivier Crête
> olivier.crete at collabora.com


More information about the nice mailing list