[libnice] Using (abusing?) libnice's event loop for outgoing traffic

Fri May 11 18:55:38 UTC 2018

2018-05-11 19:50 GMT+02:00 Olivier Crête <olivier.crete at collabora.com>:
> Hi,
>
> I have no idea what is going on with your code. Are you doing any
> blocking processing the the code that comes out of the recv command?
> You know that you can attach the receive callback to a separate thread
> each if required?
>
> If three is a deadlock and it blocks forever while calling send from
> the recv callback, I'd be very interested in a stack trace, as this is
> definitely supposed to work.
>

There may have been a misunderstanding... of course you don't know
what's happening in the code, and you don't need to. In my original
post I only briefly described how I'm refactoring the code for
context, and just to give you a better understanding of what I'm
trying to achieve using the Glib event loops.

I guess that I probably have mudded the waters and made everything
more confusing in my second post instead, where I described why we
stopped sending stuff directly (and so why we can't go the way you
suggested), which we stopped doing years ago anyway. There never was a
deadlock even there, so nothing for you to investigate or need a trace
for. The problem at the time was simpler (and again, not something you
to fix here, I'm just giving context). At the time this is what we
did:

1. libnice recv callback would fire: we decrypted SRTP, and passed the
packet to a media plugin via a callback;
2. most media plugins just relay media as it is, so this media plugin
would iterate on all recipients and ask the Janus core to send the
packet to them;
3. the Janus core, still on the same thread that originated 1., would
encrypt SRTP and do a nice_agent_send for each of the recipients.

Under normal circumstances, no issues there. It wouldn't be an issue
in 1-1 communications at all. When you have MANY recipients for the
same source, though, all those SRTP encrypts take their toll, and
since they were being done on the same thread of 1., it meant we kept
libnice busy for way too long at times; since this happened for every
packet, libnice would fail to catchup with incoming packets, delivery
would slowly but consistently delay, and eventually packets ended up
being dropped because the incoming queue would pile too much. Nothing
broken or flawed in libnice, here, we were just following the wrong
pattern, and so we changed it.

We ended up with a new GAsyncQueue + send thread for each participant.
This way, in step 3. the core simply adds a packet to the queue and
returns immediately. The thread responsible for 1. is not kept busy,
and libnice can keep on receiving packets timely, never losing a bit.
A dedicated thread goes over the queue and does the SRTP encrypts +
nice_agent_send instead.

What we're now trying to do is get rid of that additional thread, and
do the SRTP+send stuff in the same thread libnice is using.
Considering there would just be one encryption context here, there
wouldn't be the performance impact we had to drive away from. The
issue here is that we're still seeing delays in the delivery, as
explained in the first post, and hence my question on whether
libnice+glib may be contributing to that (more on this in my inline
answer below).

> All that g_socket_create_source() does it add the fd to the list of FDs
> to be given to the blocking poll() system call (which is the heart of
> g_main_context_iterate() which is called repeatedly by
> g_main_loop_run()).

Got it. This still means there's a poll involved, though, and polls
may have a timeout before they return. Especially for incoming
traffic, it's common practice to not just return immediately if
there's nothing to read, but do a poll that lasts, let's say, one
second instead. This helps with CPU usage, as you don't have to poll
all the time until something happens, but it does keep that thread
busy until the poll returns. This is why I was wondering if libnice
(or glib here) were doing something similar: in fact, since in our new
"glib loop does everything" approach the outgoing traffic is a source
as well, if the source responsible for incoming traffic is stuck there
for a bit longer than expected because there's not much traffic (which
is the case for sendonly connections where you only get some RTCP
packets from time to time), then the events associated to the outgoing
traffic source may be dispatched later than they should, and cause the
delayed delivery. Again, this is just what I thought might be
happening from pure observation, since in a sendrecv connection the
delay seemed much lower than in a sendonly one, which might be
explained by a FD waking up much more often.

Now, I'm not familiar with how glib uses poll() for file descriptors,
so I may or may not be on the right way in my investigation. I'll have
a look at the code to see how that works: looking at the documentation
for g_socket_set_timeout, it looks like the default timeout is 0
(which means "never timeout"), and I don't see it overridden in the
libnice code, so not sure what's happening there. I'll try digging
some more.

Sorry for the long mail, and thanks for your feedback.

Have a nice weekend!
Lorenzo

>
> Olivier
>
> On Fri, 2018-05-11 at 19:41 +0200, Lorenzo Miniero wrote:
>> Hi Olivier,
>>
>> this is what we did originally, but we soon had to change the
>> pattern.
>> In fact, it's not just a nice_agent_send call, as for each outgoing
>> packet we have to do an SRTP encryption, and the same source can
>> actually send to many destinations (hundreds) at the same time: this
>> did end up being blocking or a source of severe delays for the media
>> plugins in Janus, for which sending a packet is supposed to be
>> "asynchronous" no matter how many recipients there are; it also
>> caused
>> lost packets on the incoming side, as the original trigger for
>> incoming packets in Janus is the libnice recv callback, which means
>> that sending to hundreds of recipients was done on the libnice loop
>> thread, and that could take forever. Besides, libsrtp is not thread
>> safe, and having a unique point where SRTP encryption would happen
>> avoided the need for a lock.
>>
>> This is why we ended up with a dedicated thread per participant,
>> which
>> is fine, but I was investigating for a way to only use a single
>> thread
>> for both in and out. Do you know if there's anything in libnice,
>> configurable or not, that may contribute to the problems I'm
>> experiencing? I tried looking at the code, but all I could find was
>> that apparently g_socket_create_source is used for incoming media,
>> and
>> it's not clear from the documentation how it works internally (e.g.,
>> in terms of polling, how much it polls before giving up, etc.).
>>
>> Lorenzo
>>
>>
>> 2018-05-11 19:31 GMT+02:00 Olivier Crête <olivier.crete at collabora.com
>> >:
>> > Hi,
>> >
>> > I would just get rid of the send thread and the GASyncQueue
>> > completely.
>> >   Sending in libnice is non-blocking, so you should be able to just
>> > call the nice_agent_send() at the place where you would normally
>> > put
>> > the things in a queue.
>> >
>> > Olivier
>> >
>> > On Wed, 2018-05-09 at 15:50 +0200, Lorenzo Miniero wrote:
>> > > Hi all,
>> > >
>> > > as you may know, I'm using libnice in Janus, a WebRTC server. The
>> > > way
>> > > it's used right now is with two different threads per ICE agent:
>> > > one
>> > > runs the agent's GMainContext+GMainLoop, and as such is
>> > > responsible
>> > > for notifying the application about incoming events and packets
>> > > via
>> > > the libnice callback; another thread handles outgoing traffic,
>> > > with a
>> > > GAsyncQueue to queue packets, prepare them and shoot them out via
>> > > nice_agent_send() calls.
>> > >
>> > > These past few days I've been playing with an attempt to actually
>> > > put
>> > > those two activities together in a single thread, which would
>> > > simplify
>> > > things (e.g., in terms of locking and other things) and, ideally,
>> > > optimize resources (we'd only spawn half the threads we do now).
>> > > To
>> > > do
>> > > so, I decided to try and re-use the agent's event loop for that,
>> > > and
>> > > followed an excellent blog post by Philip Withnall to create my
>> > > own
>> > > GSource for the purpose:
>> > > https://tecnocode.co.uk/2015/05/05/a-detailed-look-at-gsource/
>> > > This was very helpful, as I ended up doing something very
>> > > similar,
>> > > since I was already using a GAsyncQueue for outgoing media
>> > > myself.
>> > >
>> > > Anyway, while this "works", the outgoing media is quite delayed
>> > > when
>> > > rendered by the receiving browser. I verified that media from
>> > > clients
>> > > does come in at the correct rate and with no delays there, by
>> > > redirecting the traffic to a monitoring gstreamer pipeline, which
>> > > means somehow the outgoing path is to blame. I "tracked" packets
>> > > both
>> > > using wireshark and log lines (e.g., when they were queued and
>> > > when
>> > > they were dispatched by the GSource), and did notice that some
>> > > packets
>> > > are handled fast, while others much later, and there are "holes"
>> > > when
>> > > no packet is sent (for ~500ms at times).  This shouldn't be the
>> > > code
>> > > failing to keep up with the things to do in time (since there's
>> > > SRTP,
>> > > there's encryption involved for every packet), as the CPU usage
>> > > is
>> > > always quite low.
>> > >
>> > > At first I thought this could be ascribed to GSource priorities,
>> > > but
>> > > even playing with those nothing changed. One thing I noticed,
>> > > though,
>> > > was that the delay was quite more apparent in sendonly
>> > > connections
>> > > than in sendrecv ones. This means that, while in an EchoTest demo
>> > > it
>> > > is barely noticeable (even though still worse than with the two
>> > > threads), in the VideoRoom demo, where you have monodirectional
>> > > agents
>> > > (some used to just publish media, others just to subscribe), it
>> > > is
>> > > much more evident (~600ms). Considering that the publisher's
>> > > media is
>> > > fine (as confirmed by the gstreamer monitor mentioned above), the
>> > > only
>> > > explanation I came up with was that, while in a bidirectional
>> > > communication there's a lot of incoming traffic, on a
>> > > subscriber's
>> > > connection you only get occasional RTCP packets from time to
>> > > time. In
>> > > case there's some sort of timed poll deep in the libnice code,
>> > > due to
>> > > a single thread processing both directions this might mean having
>> > > the
>> > > event loop kept busy waiting for a file descriptor to generate an
>> > > event, while outgoing packets pile up in the queue.
>> > >
>> > > Does this make sense, and might this indeed be what's happening?
>> > > Is
>> > > what I'm trying to do indeed feasible or are there better ways to
>> > > try
>> > > and do it properly, e.g., via nice_agent_recv_messages instead of
>> > > the
>> > > recv callback? This is my first attempt at using Glib events in a
>> > > more
>> > > "conversational" way than for sporadic events, and so I realize I
>> > > may
>> > > be doing some rookie mistakes here.
>> > >
>> > > Thanks!
>> > > Lorenzo
>> > > _______________________________________________
>> > > nice mailing list
>> > > nice at lists.freedesktop.org
>> > > https://lists.freedesktop.org/mailman/listinfo/nice
>> >
>> > --
>> > Olivier Crête
>> > olivier.crete at collabora.com
> --
> Olivier Crête
> olivier.crete at collabora.com