[libnice] Connection dropped on Windows while fast retransmitting in reliable agent.

Radosław Kołodziejczyk radek.kolodziejczyk at gmail.com
Wed Nov 12 06:15:29 PST 2014


Hello,

We've been experiencing a lot of connection drops due to ECONNABORTED error
in reliable agent while sending a lot of data (large file sending). It was
critical to us so I did some digging and here's what I've gathered.

Connection gets aboted in agent/pseudotcp.c (process()) while doing fast
retransmit. Speciffically here:

if (!transmit(self, g_queue_peek_head (&priv->slist), now)) {
          closedown(self, ECONNABORTED);
          return FALSE;
        }

Now, this might sound ok, but the problem is that the transmit function
does not account for EAGAIN or EWOULDBLOCK errors (which are not really
fatal) and this created a very peculiar race condition that gave us a
headache. Enable extended logs - problem gone. Add some sleep - problem
gone. The reason for that is that if the fast retransmit happens to be when
the socket is (temporarily) busy, our connection goes down. If you give
sockets a tiny bit of time to flush - problem goes away.

So - digging deeper this problem was comming from:

agent/pseudotcp.c (transmit()):

 /* Write out the packet. */
    wres = packet(self, seq, flags,
        segment->seq - priv->snd_una, nTransmit, now);

agent/pseudotcp.c (packet()):

 wres = priv->callbacks.WritePacket(self, (gchar *) buffer.u8, len +
HEADER_SIZE,
                                     priv->callbacks.user_data);

agent/agent.c (pseudo_tcp_socket_write_packet()):

if (nice_socket_send (sock, addr, len, buffer))
      return WR_SUCCESS;

socket/socket.c (nice_socket_send()):

ret = sock->send_messages (sock, to, &local_message, 1);

Now here from the function description I see that this function knows when
the error is EWOULDBLOCK:

/* Convenience wrapper around nice_socket_send_messages(). Returns the
number of
 * bytes sent on success (which will be @len), zero if sending would block,
or
 * -1 on error. */

But the information is then lost / not taken into consideration. What I did
was to put the write in a loop and if the error is EWOULDBLOCK just try
again for the maximum 100 times (it's an arbitraty number, so far just for
tests). So it now looks like that:

gssize
nice_socket_send (NiceSocket *sock, const NiceAddress *to, gsize len,
    const gchar *buf)
{
  GOutputVector local_buf = { buf, len };
  NiceOutputMessage local_message = { &local_buf, 1};
  gint ret         = 0;
  gint max_iter = 100;

  while (ret == 0 && max_iter > 0)
  {
    //
    // Send would block. Try a couple times more.
    //

    ret = sock->send_messages (sock, to, &local_message, 1);

    max_iter --;
  }

  if (ret == 1)
    return len;
  return ret;
}

And the problem is now completely gone. We can send big files and really
large amounts of data between Windows, Linux and in mixed environments. Now
of course this kind of solution might not be what you want, but I just
wanted to signal a problem to which I think I found a cause and possible
solution. Now you can address it in any fitting way you might think of.

If you have any thoughts on that, please share them.

Kind regards,
Radosław Kołodziejczyk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/nice/attachments/20141112/a58758c8/attachment.html>


More information about the nice mailing list