[PATCH wayland 8/8] client: Use read preparation API in wl_display_dispatch_queue()

Fri Oct 9 01:42:04 PDT 2015

On Thu, Oct 08, 2015 at 04:29:03PM +0300, Pekka Paalanen wrote:
> On Fri,  2 Oct 2015 17:32:59 +0800
> Jonas Ådahl <jadahl at gmail.com> wrote:
> 
> > Instead of doing things that do the equivalent of using
> > wl_display_prepare_read() and friends, just use the public API. The
> > only semantical difference is that we will now unlock and lock the mutex
> > more times compared to before.
> > 
> > Signed-off-by: Jonas Ådahl <jadahl at gmail.com>
> > ---
> > 
> > There is no real need to merge this patch, it only makes it easier to
> > understand that the _dispatch(_queue)() functions are compatible with the
> > prepare_read() family.
> 
> Hi Jonas,
> 
> I think this would be a good improvement on maintainability. However, I
> don't think this patch is correct. It also points out a flaw in
> wl_display_flush().
> 
> >  src/wayland-client.c | 35 ++++++++---------------------------
> >  1 file changed, 8 insertions(+), 27 deletions(-)
> > 
> > diff --git a/src/wayland-client.c b/src/wayland-client.c
> > index ca2495f..8f53f6f 100644
> > --- a/src/wayland-client.c
> > +++ b/src/wayland-client.c
> > @@ -1507,15 +1507,9 @@ wl_display_dispatch_queue(struct wl_display *display,
> >  	struct pollfd pfd[2];
> >  	int ret;
> >  
> > -	pthread_mutex_lock(&display->mutex);
> > -
> > -	ret = dispatch_queue(display, queue);
> > -	if (ret == -1)
> > -		goto err_unlock;
> > -	if (ret > 0) {
> > -		pthread_mutex_unlock(&display->mutex);
> > +	ret = wl_display_dispatch_queue_pending(display, queue);
> > +	if (ret != 0)
> >  		return ret;
> > -	}
> 
> As the mutex is not locked and reader_count has not been bumped yet,
> another thread may be queueing events in our queue. We could lose that
> race.

Will comment on this further below.

> 
> >  
> >  	/* We ignore EPIPE here, so that we try to read events before
> >  	 * returning an error.  When the compositor sends an error it
> > @@ -1524,12 +1518,11 @@ wl_display_dispatch_queue(struct wl_display *display,
> >  	ret = wl_connection_flush(display->connection);
> >  	if (ret < 0 && errno != EAGAIN && errno != EPIPE) {
> 
> I believe wl_connection_flush() requires the display to be locked,
> which it no longer is. This should be replaced with a call to
> wl_display_flush().
> 
> But, wl_display_flush() will raise a fatal error on EPIPE, which makes
> the wl_display irreversibly unusable.
> 
> I think this raises an issue with wl_display_flush(), the same what the
> comment about EPIPE talks about. If an error has been raised on the
> wl_display, it won't be possible to read events or dispatch queues at
> all anymore. The comment implies that this causes prepare-read API
> using clients to be often incapable of reporting error events.
> 
> I believe the race is lost, when the client is busy doing something
> instead of waiting in poll(), and the compositor sends an error. When
> the client returns to its main loop and prepares to poll() again,
> wl_display_flush() will get EPIPE.
> 
> I think this would be a nice subject for a new test (since the existing
> test suite on neither Wayland nor Weston fails with this patch) that
> opens a connetion, sends a request that causes a protocol error, sleeps
> a bit to let the server react, and then resumes the normal event loop.
> The test would be checking that it gets the correct error event
> delivered.
> 
> I did a quick test:
> 
> I had this patch applied in wayland, and I hacked the EPIPE special
> case away. In Weston, I hacked the bad-buffer as follows:
> 
> 	wl_surface_attach(surface, bad_buffer, 0, 0);
>  	wl_surface_damage(surface, 0, 0, 200, 200);
>  	frame_callback_set(surface, &frame);
>  	wl_surface_commit(surface);
> +	wl_display_flush(client->wl_display);
> +	sleep(1);
> +	wl_display_sync(client->wl_display);
>  	frame_callback_wait_nofail(client, &frame);
>  
>  	expect_protocol_error(client, &wl_buffer_interface,
>  			      WL_SHM_ERROR_INVALID_FD);
> 
> Flush ensures the bad request gets sent. Sleep lets the compositor
> reply. Then we need to have something in the send buffer, so I used
> wl_display_sync. Finally, frame_callback_wait_nofail() calls
> wl_display_dispatch(). The test fails because the expected error is
> never dispatched.
> 
> I believe this case is even more confusing to a person debugging a
> protocol error, because I don't think WAYLAND_DEBUG=client will show
> the error event. Debug printing happens during dispatch.
> 
> If I put the EPIPE special case back, the test succeeds again.

For the purpose of documenting the discussion we had on IRC regarding
this; a potential solution to this is to change the behaviour of
wl_display_flush() to not set the fatal display error on EPIPE. The idea
is that we'd sooner or later set the fatal display error anyway when
trying to read. It's a bit of a semantical change, but we might just get
away with it, and without it, we wont be able to reliably read the error
when using the read preparation API.

> 
> >  		display_fatal_error(display, errno);
> > -		goto err_unlock;
> > +		return -1;
> >  	}
> >  
> > -	display->reader_count++;
> > -
> > -	pthread_mutex_unlock(&display->mutex);
> > +	if (wl_display_prepare_read_queue(display, queue) != 0)
> > +		return -1;
> 
> This changes the behaviour of this function. Previously, when the mutex
> was locked over both dispatch_queue() and reader_count++, here was no
> race failure mode.
> 
> Instead, we should really have the loop as documented in
> wl_display_prepare_read_queue(), in full with all error checking. This
> will actually give us a proper, simple, working example of how to
> use the prepare-read API correctly. We've been lacking that for too
> long.

The problem with this is that the semantics of this function is
different from the normal usage of the preparation API, because this
function is expected to not block if there were already events on the
queue. Not that it is possible for the client to know whether there are
any though, so maybe it is fine.

I don't think it's even possible to implement the
wl_display_dispatch(_queue)() semantics with the current read preparation
API, because the intended usage (the loop dispatch_pending() until
prepare_read() succeeds) always assumes we want to eventually read. We'd
need something like an atomic "dispatch_pending_or_prepare_read()" to
avoid this race without changing the semantics.

Jonas

> 
> >  
> >  	pfd[0].fd = display->fd;
> >  	pfd[0].events = POLLIN;
> > @@ -1542,22 +1535,10 @@ wl_display_dispatch_queue(struct wl_display *display,
> >  		return -1;
> >  	}
> >  
> > -	pthread_mutex_lock(&display->mutex);
> > -
> > -	if (read_events(display) == -1)
> > -		goto err_unlock;
> > -
> > -	ret = dispatch_queue(display, queue);
> > -	if (ret == -1)
> > -		goto err_unlock;
> > -
> > -	pthread_mutex_unlock(&display->mutex);
> > -
> > -	return ret;
> > +	if (wl_display_read_events(display) == -1)
> > +		return -1;
> >  
> > - err_unlock:
> > -	pthread_mutex_unlock(&display->mutex);
> > -	return -1;
> > +	return wl_display_dispatch_queue_pending(display, queue);
> >  }
> >  
> >  /** Dispatch pending events in an event queue
> 
> The rest looks fine to me.
> 
> 
> Thanks,
> pq