Catastrophic blocking

Wed Feb 29 01:52:33 PST 2012

On 02/29/2012 09:58 AM, Pekka Paalanen wrote:
> On Tue, 28 Feb 2012 14:32:21 -0500
> Kristian Hoegsberg<hoegsberg at gmail.com>  wrote:
> 
>> On Mon, Feb 27, 2012 at 04:57:42PM +0100, Samuel Rødal wrote:
>>> Ignore previous patch, here's the correct version.
>>
>>>  From 4e1bedaaf05b576f5191f8fe3a34904ab9707414 Mon Sep 17 00:00:00 2001
>>> From: =?UTF-8?q?Samuel=20R=C3=B8dal?=<samuel.rodal at nokia.com>
>>> Date: Mon, 27 Feb 2012 15:17:20 +0100
>>> Subject: [PATCH] Allow update function to not be set in wl_display_get_fd
>>>
>>> The same check is done in connection_update, and now with
>>> wl_display_flush() there's less need for the client to need to know the
>>> connection mask.
>>
>> Yeah, ok, looks good.  If you're paranoid about blocking on write,
>> you need to poll for write of course, but for non-broken
>> apps/compositors the write should never block.
> 
> About blocking... broken apps are a fact we must tolerate. If an app
> gets stuck and does not read its fd anymore, does that make the
> corresponding fd in the server count as not writable? Or will it become
> non-writable only after possible kernel buffers have been filled?
> 
> Is a writable fd, as indicated by epoll, guaranteed to not block on
> sendmsg()? I don't know, but I wouldn't think it is without O_NONBLOCK,
> since the kernel cannot cache an arbitrary amount of data, can it?
> 

sendmsg() can, as all write() operations, return less than the number
of bytes you wanted to send, so any space remaining in kernel send
buffers should mean that you can send data without blocking.

> I also don't see any of the fds getting O_NONBLOCK anywhere.
> 
> Could the whole server be blocked by a single client not reading its
> fd, making the whole computer appear frozen to a user?
> 
> So, I made a test with the current Weston. I added the following patch:
> 
> --- a/clients/clickdot.c
> +++ b/clients/clickdot.c
> @@ -28,6 +28,7 @@
>   #include<cairo.h>
>   #include<math.h>
>   #include<assert.h>
> +#include<unistd.h>
> 
>   #include<linux/input.h>
>   #include<wayland-client.h>
> @@ -104,6 +105,10 @@ key_handler(struct window *window, struct input *input, uint32_t time,
>          case XK_Escape:
>                  display_exit(clickdot->display);
>                  break;
> +       case XK_h:
> +               while (1)
> +                       sleep(1);
> +               break;
>          }
>   }
> 
> Started Weston in X and clickdot, and made sure it all works, then
> pressed 'h' to make clickdot hang.
> 
> For a while, everything seemed good. Only clickdot was stuck and Weston
> still worked.

Presumably while there was still space available in the sendbuffers, or
Weston would hang immediately.

> After a few minutes of mouse-waving, Weston got stuck.

This shouldn't happen if we never try to write to an fd that isn't writeable,
so it would seem that wl_closure_send() (which I assume just kills off the
client in case it doesn't respond to something in a long enough time) doesn't
bother with polling for writeability before it tries to send.

> The backtrace of Weston was:
> 
> #0  0x00007f13fa9ad4d0 in __sendmsg_nocancel () from /lib64/libc.so.6
> #1  0x00007f13fc60e3a1 in wl_connection_data (connection=0x2776360, mask=2) at connection.c:272
> #2  0x00007f13fc60e654 in wl_connection_write (connection=0x2776360, data=0x277aa80, count=20) at connection.c:335
> #3  0x00007f13fc60f9cc in wl_closure_send (closure=0x277a910, connection=0x2776360) at connection.c:743
> #4  0x00007f13fc609f2c in wl_resource_post_event (resource=0x26cb4f0, opcode=0) at wayland-server.c:107
> #5  0x00007f13fc60ab41 in default_grab_motion (grab=0x2655f98, time=3360222569, x=292, y=213) at wayland-server.c:457
> #6  0x0000000000409cc7 in notify_motion (device=0x2655f00, time=3360222569, x=380, y=305) at compositor.c:1487
> #7  0x00007f13f8ce74d3 in x11_compositor_handle_event (fd=9, mask=1, data=0x25b9960) at compositor-x11.c:604
> #8  0x00007f13fc60d16e in wl_event_source_fd_dispatch (source=0x27034a0, ep=0x7fffcecd9c00) at event-loop.c:76
> #9  0x00007f13fc60dbeb in wl_event_loop_dispatch (loop=0x25b8900, timeout=-1) at event-loop.c:462
> #10 0x00007f13fc60b7c1 in wl_display_run (display=0x25b88b0) at wayland-server.c:837
> #11 0x000000000040c3cd in main (argc=3, argv=0x7fffcecda018) at compositor.c:2518
> 
> Just like I assumed, it is now blocking in sendmsg(). Killing clickdot,
> Weston came back to life.
> 

Because sendmsg() returns -1 and sets errno to EPIPE, which seems to
be handled gracefully.

> We really need to have the file descriptors in the server to be of
> non-blocking kind.
> 

Or always poll them before writing to them. Making them nonblocking
and handling EAGAIN is probably a better way to go though, especially
since it would be as simple as adding MSG_DONTWAIT to the flags arg
of the sendmsg() call.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.