Catastrophic blocking

Wed Feb 29 08:50:10 PST 2012

On Wed, Feb 29, 2012 at 10:58:56AM +0200, Pekka Paalanen wrote:
> On Tue, 28 Feb 2012 14:32:21 -0500
> Kristian Hoegsberg <hoegsberg at gmail.com> wrote:
> 
> > On Mon, Feb 27, 2012 at 04:57:42PM +0100, Samuel Rødal wrote:
> > > Ignore previous patch, here's the correct version.
> > 
> > > From 4e1bedaaf05b576f5191f8fe3a34904ab9707414 Mon Sep 17 00:00:00 2001
> > > From: =?UTF-8?q?Samuel=20R=C3=B8dal?= <samuel.rodal at nokia.com>
> > > Date: Mon, 27 Feb 2012 15:17:20 +0100
> > > Subject: [PATCH] Allow update function to not be set in wl_display_get_fd
> > > 
> > > The same check is done in connection_update, and now with
> > > wl_display_flush() there's less need for the client to need to know the
> > > connection mask.
> > 
> > Yeah, ok, looks good.  If you're paranoid about blocking on write,
> > you need to poll for write of course, but for non-broken
> > apps/compositors the write should never block.
> 
> About blocking... broken apps are a fact we must tolerate. If an app
> gets stuck and does not read its fd anymore, does that make the
> corresponding fd in the server count as not writable? Or will it become
> non-writable only after possible kernel buffers have been filled?
> 
> Is a writable fd, as indicated by epoll, guaranteed to not block on
> sendmsg()? I don't know, but I wouldn't think it is without O_NONBLOCK,
> since the kernel cannot cache an arbitrary amount of data, can it?
> 
> I also don't see any of the fds getting O_NONBLOCK anywhere.
> 
> Could the whole server be blocked by a single client not reading its
> fd, making the whole computer appear frozen to a user?
> 
> So, I made a test with the current Weston. I added the following patch:
> 
> --- a/clients/clickdot.c
> +++ b/clients/clickdot.c
> @@ -28,6 +28,7 @@
>  #include <cairo.h>
>  #include <math.h>
>  #include <assert.h>
> +#include <unistd.h>
>  
>  #include <linux/input.h>
>  #include <wayland-client.h>
> @@ -104,6 +105,10 @@ key_handler(struct window *window, struct input *input, uint32_t time,
>         case XK_Escape:
>                 display_exit(clickdot->display);
>                 break;
> +       case XK_h:
> +               while (1)
> +                       sleep(1);
> +               break;
>         }
>  }
> 
> Started Weston in X and clickdot, and made sure it all works, then
> pressed 'h' to make clickdot hang.
> 
> For a while, everything seemed good. Only clickdot was stuck and Weston
> still worked. After a few minutes of mouse-waving, Weston got stuck.
> The backtrace of Weston was:
> 
> #0  0x00007f13fa9ad4d0 in __sendmsg_nocancel () from /lib64/libc.so.6
> #1  0x00007f13fc60e3a1 in wl_connection_data (connection=0x2776360, mask=2) at connection.c:272
> #2  0x00007f13fc60e654 in wl_connection_write (connection=0x2776360, data=0x277aa80, count=20) at connection.c:335
> #3  0x00007f13fc60f9cc in wl_closure_send (closure=0x277a910, connection=0x2776360) at connection.c:743
> #4  0x00007f13fc609f2c in wl_resource_post_event (resource=0x26cb4f0, opcode=0) at wayland-server.c:107
> #5  0x00007f13fc60ab41 in default_grab_motion (grab=0x2655f98, time=3360222569, x=292, y=213) at wayland-server.c:457
> #6  0x0000000000409cc7 in notify_motion (device=0x2655f00, time=3360222569, x=380, y=305) at compositor.c:1487
> #7  0x00007f13f8ce74d3 in x11_compositor_handle_event (fd=9, mask=1, data=0x25b9960) at compositor-x11.c:604
> #8  0x00007f13fc60d16e in wl_event_source_fd_dispatch (source=0x27034a0, ep=0x7fffcecd9c00) at event-loop.c:76
> #9  0x00007f13fc60dbeb in wl_event_loop_dispatch (loop=0x25b8900, timeout=-1) at event-loop.c:462
> #10 0x00007f13fc60b7c1 in wl_display_run (display=0x25b88b0) at wayland-server.c:837
> #11 0x000000000040c3cd in main (argc=3, argv=0x7fffcecda018) at compositor.c:2518
> 
> Just like I assumed, it is now blocking in sendmsg(). Killing clickdot,
> Weston came back to life.
> 
> We really need to have the file descriptors in the server to be of
> non-blocking kind.

We talked about this in IRC and I pushed:

commit b26774da5b907a71aa7c72ee1feb6dcde92b9fcb
Author: Kristian Høgsberg <krh at bitplanet.net>
Date:   Wed Feb 29 11:07:48 2012 -0500

    Don't block when flushing a full protocol buffer

    In case the client isn't responding, this will block the compositor.
    Instead we flush with MSG_DONTWAIT, which lets us fill up the kernel buffer
    as much as we can (after not returning EPOLLOUT anymore it still can take
    80k more), and then disconnect the client if we get EAGAIN.

which handles the problem you hit.  The connection code polls for
writable just fine and only writes when the fd returns EPOLLOUT.
Except in the above case, which is where it typically happens.  When
the protocol buffer is full, we write without checking for writable,
and after a while that hangs.  One option would be to just kill the
client if the protocol buffer overflows, but that doesn't leave the
client a lot of rope.  Also, aparently an fd can be "spuriously
writable", that is, even if we get EPOLLOUT, writing to it may still
block.  That sure seems like a bug to me, but if we pass MSG_DONTWAIT
to sendmsg we guard against that and can fill up the kernel buffer
before we kill the client.

This has to be a last resort though.  We can't just kill the clients,
so we need to stop sending events to them early on if we detect
they're not responding.  Early enough that we can still queue up
pointer and keyboard leave events and then mark the clients surfaces
as not-currently-taking-input.  When/if the client comes back to life
it will see a consistent event stream (lost kb and mouse focus) and
we'll unban its surfaces.

Kristian