[PATCH wayland 8/8] client: Use read preparation API in wl_display_dispatch_queue()

Fri Oct 9 03:08:27 PDT 2015

On Fri, Oct 09, 2015 at 12:50:24PM +0300, Pekka Paalanen wrote:
> On Fri, 9 Oct 2015 16:42:04 +0800
> Jonas Ådahl <jadahl at gmail.com> wrote:
> 
> > On Thu, Oct 08, 2015 at 04:29:03PM +0300, Pekka Paalanen wrote:
> > > On Fri,  2 Oct 2015 17:32:59 +0800
> > > Jonas Ådahl <jadahl at gmail.com> wrote:
> > > 
> > > > Instead of doing things that do the equivalent of using
> > > > wl_display_prepare_read() and friends, just use the public API. The
> > > > only semantical difference is that we will now unlock and lock the mutex
> > > > more times compared to before.
> > > > 
> > > > Signed-off-by: Jonas Ådahl <jadahl at gmail.com>
> > > > ---
> > > > 
> > > > There is no real need to merge this patch, it only makes it easier to
> > > > understand that the _dispatch(_queue)() functions are compatible with the
> > > > prepare_read() family.
> > > 
> > > Hi Jonas,
> > > 
> > > I think this would be a good improvement on maintainability. However, I
> > > don't think this patch is correct. It also points out a flaw in
> > > wl_display_flush().
> > > 
> > > >  src/wayland-client.c | 35 ++++++++---------------------------
> > > >  1 file changed, 8 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/src/wayland-client.c b/src/wayland-client.c
> > > > index ca2495f..8f53f6f 100644
> > > > --- a/src/wayland-client.c
> > > > +++ b/src/wayland-client.c
> > > > @@ -1507,15 +1507,9 @@ wl_display_dispatch_queue(struct wl_display *display,
> > > >  	struct pollfd pfd[2];
> > > >  	int ret;
> > > >  
> > > > -	pthread_mutex_lock(&display->mutex);
> > > > -
> > > > -	ret = dispatch_queue(display, queue);
> > > > -	if (ret == -1)
> > > > -		goto err_unlock;
> > > > -	if (ret > 0) {
> > > > -		pthread_mutex_unlock(&display->mutex);
> > > > +	ret = wl_display_dispatch_queue_pending(display, queue);
> > > > +	if (ret != 0)
> > > >  		return ret;
> > > > -	}
> > > 
> > > As the mutex is not locked and reader_count has not been bumped yet,
> > > another thread may be queueing events in our queue. We could lose that
> > > race.
> > 
> > Will comment on this further below.
> > 
> > > 
> > > >  
> > > >  	/* We ignore EPIPE here, so that we try to read events before
> > > >  	 * returning an error.  When the compositor sends an error it
> > > > @@ -1524,12 +1518,11 @@ wl_display_dispatch_queue(struct wl_display *display,
> > > >  	ret = wl_connection_flush(display->connection);
> > > >  	if (ret < 0 && errno != EAGAIN && errno != EPIPE) {
> > > 
> > > I believe wl_connection_flush() requires the display to be locked,
> > > which it no longer is. This should be replaced with a call to
> > > wl_display_flush().
> > > 
> > > But, wl_display_flush() will raise a fatal error on EPIPE, which makes
> > > the wl_display irreversibly unusable.
> > > 
> > > I think this raises an issue with wl_display_flush(), the same what the
> > > comment about EPIPE talks about. If an error has been raised on the
> > > wl_display, it won't be possible to read events or dispatch queues at
> > > all anymore. The comment implies that this causes prepare-read API
> > > using clients to be often incapable of reporting error events.
> > > 
> > > I believe the race is lost, when the client is busy doing something
> > > instead of waiting in poll(), and the compositor sends an error. When
> > > the client returns to its main loop and prepares to poll() again,
> > > wl_display_flush() will get EPIPE.
> > > 
> > > I think this would be a nice subject for a new test (since the existing
> > > test suite on neither Wayland nor Weston fails with this patch) that
> > > opens a connetion, sends a request that causes a protocol error, sleeps
> > > a bit to let the server react, and then resumes the normal event loop.
> > > The test would be checking that it gets the correct error event
> > > delivered.
> > > 
> > > I did a quick test:
> > > 
> > > I had this patch applied in wayland, and I hacked the EPIPE special
> > > case away. In Weston, I hacked the bad-buffer as follows:
> > > 
> > > 	wl_surface_attach(surface, bad_buffer, 0, 0);
> > >  	wl_surface_damage(surface, 0, 0, 200, 200);
> > >  	frame_callback_set(surface, &frame);
> > >  	wl_surface_commit(surface);
> > > +	wl_display_flush(client->wl_display);
> > > +	sleep(1);
> > > +	wl_display_sync(client->wl_display);
> > >  	frame_callback_wait_nofail(client, &frame);
> > >  
> > >  	expect_protocol_error(client, &wl_buffer_interface,
> > >  			      WL_SHM_ERROR_INVALID_FD);
> > > 
> > > Flush ensures the bad request gets sent. Sleep lets the compositor
> > > reply. Then we need to have something in the send buffer, so I used
> > > wl_display_sync. Finally, frame_callback_wait_nofail() calls
> > > wl_display_dispatch(). The test fails because the expected error is
> > > never dispatched.
> > > 
> > > I believe this case is even more confusing to a person debugging a
> > > protocol error, because I don't think WAYLAND_DEBUG=client will show
> > > the error event. Debug printing happens during dispatch.
> > > 
> > > If I put the EPIPE special case back, the test succeeds again.
> > 
> > For the purpose of documenting the discussion we had on IRC regarding
> > this; a potential solution to this is to change the behaviour of
> > wl_display_flush() to not set the fatal display error on EPIPE. The idea
> > is that we'd sooner or later set the fatal display error anyway when
> > trying to read. It's a bit of a semantical change, but we might just get
> > away with it, and without it, we wont be able to reliably read the error
> > when using the read preparation API.
> 
> Yeah.
> 
> > > 
> > > >  		display_fatal_error(display, errno);
> > > > -		goto err_unlock;
> > > > +		return -1;
> > > >  	}
> > > >  
> > > > -	display->reader_count++;
> > > > -
> > > > -	pthread_mutex_unlock(&display->mutex);
> > > > +	if (wl_display_prepare_read_queue(display, queue) != 0)
> > > > +		return -1;
> > > 
> > > This changes the behaviour of this function. Previously, when the mutex
> > > was locked over both dispatch_queue() and reader_count++, here was no
> > > race failure mode.
> > > 
> > > Instead, we should really have the loop as documented in
> > > wl_display_prepare_read_queue(), in full with all error checking. This
> > > will actually give us a proper, simple, working example of how to
> > > use the prepare-read API correctly. We've been lacking that for too
> > > long.
> > 
> > The problem with this is that the semantics of this function is
> > different from the normal usage of the preparation API, because this
> > function is expected to not block if there were already events on the
> > queue. Not that it is possible for the client to know whether there are
> > any though, so maybe it is fine.
> 
> Ah, yes indeed.
> 
> > I don't think it's even possible to implement the
> > wl_display_dispatch(_queue)() semantics with the current read preparation
> > API, because the intended usage (the loop dispatch_pending() until
> > prepare_read() succeeds) always assumes we want to eventually read. We'd
> > need something like an atomic "dispatch_pending_or_prepare_read()" to
> > avoid this race without changing the semantics.
> 
> How about this:
> 
> if (wl_display_prepare_read_queue() < 0)
> 	/* there is something in the queue, so dispatch once and return success */
> 	return wl_display_dispatch_queue();
> 
> /* now we are registered to read, no-one can queue events until actual read */
> wl_display_flush();
> poll();
> read...

Ah, yes indeed, that looks like it would actually work, except your call
to wl_display_dispatch_queue() would be
wl_display_dispatch_queue_pending() instead.

Jonas

> 
> 
> Thanks,
> pq