[PATCH] client: Allow send error recovery without an abort

Mon Jun 18 11:58:38 UTC 2018

On Tue,  5 Jun 2018 18:14:54 -0700
Lloyd Pique <lpique at google.com> wrote:

> Introduce a new call wl_display_set_error_handler_client(), which allows
> a client to register a callback function which is invoked if there is an
> error (possibly transient) while sending messages to the server.
> 
> This allows a Wayland client that is actually a nested server to try and
> recover more gracefully from send errors, allowing it one to disconnect
> and reconnect to the server if necessary.
> 
> The handler is called with the wl_display and the errno, and is expected
> to return one of WL_SEND_ERROR_ABORT, WL_SEND_ERROR_FATAL, or
> WL_SEND_ERROR_RETRY. The core-client code will then respectively abort()
> the process (the default if no handler is set), set the display context
> into an error state, or retry the send operation that failed.
> 
> The existing display test is extended to exercise the new function.

Hi Lloyd,

technically this looks like a quality submission, thank you for that. I
am particularly happy to see all the new cases are added into the test
suite. I agree that calling wl_abort() is nasty and it would be good to
have something else.

However, I'm not quite convinced of the introduced complexity here.
More comments inline.

> Signed-off-by: Lloyd Pique <lpique at google.com>
> ---
>  COPYING                   |   1 +
>  src/wayland-client-core.h |  75 +++++++++++++++++
>  src/wayland-client.c      |  67 +++++++++++++++-
>  tests/display-test.c      | 165 +++++++++++++++++++++++++++++++++++++-
>  4 files changed, 306 insertions(+), 2 deletions(-)
> 
> diff --git a/COPYING b/COPYING
> index eb25a4e..bace5d7 100644
> --- a/COPYING
> +++ b/COPYING
> @@ -2,6 +2,7 @@ Copyright © 2008-2012 Kristian Høgsberg
>  Copyright © 2010-2012 Intel Corporation
>  Copyright © 2011 Benjamin Franzke
>  Copyright © 2012 Collabora, Ltd.
> +Copyright © 2018 Google LLC
>  
>  Permission is hereby granted, free of charge, to any person obtaining a
>  copy of this software and associated documentation files (the "Software"),
> diff --git a/src/wayland-client-core.h b/src/wayland-client-core.h
> index 03e781b..a3e4b8e 100644
> --- a/src/wayland-client-core.h
> +++ b/src/wayland-client-core.h
> @@ -257,6 +257,81 @@ wl_display_cancel_read(struct wl_display *display);
>  int
>  wl_display_read_events(struct wl_display *display);
>  
> +/** \enum wl_send_error_strategy
> + *
> + * This enum are the valid values that can be returned from a
> + * wl_send_error_handler_func_t
> + *
> + * \sa wl_send_error_handler_func_t
> + */
> +enum wl_send_error_strategy {
> +	/** Abort the client */
> +	WL_SEND_ERROR_ABORT,
> +	/** Put the display into a fatal error state */
> +	WL_SEND_ERROR_FATAL,
> +	/** Retry the send operation */
> +	WL_SEND_ERROR_RETRY,
> +};
> +
> +/**
> + * \brief A function pointer type for customizing client send error handling
> + *
> + * A client send error strategy handler is a callback function which the client
> + * can set to direct the core client library how to respond if an error is
> + * encountered while sending a message.
> + *
> + * Normally messages are enqueued to an output buffer by the core client
> + * library, and the user of the core client library is expected to call
> + * wl_display_flush() occasionally, but as that is non-blocking, messages can
> + * queue up in the output buffer.  If later calls to send messages happen to
> + * fill up the output buffer, the core client library will make an internal
> + * call to flush the output buffer.  But if the write call unexpectedly fails,
> + * as there is no good way to return an error to the caller, the core client
> + * library will log a message and call abort().

The analysis of the current situation is correct.

If you know you are sending lots of requests, you should be calling
wl_display_flush() occasionally. It will return -1 if flushing fails to
let you know you need to wait. The main problem I see with that is when
to attempt to flush.

I remember some discussion from the past, where we would make the
soft-buffer in libwayland-client larger and introduce a check API which
would return whether, say, more than half of the space is already
consumed. That would work as a trigger for an extra flush at a point
convenient to the client.

I remember discussing something like that with Daniel Stone, but I
can't find it. Might have been in IRC.

Or maybe it was more like, double the soft-buffer size, attempt an
automatic flush if it is more than half-full, and if that fails,
continue and raise a flag that the user can query at a convenient time
and do the dance for manual flush and poll for writable. The remaining
soft-buffer should hopefully allow the manual flush before the buffer
runs out.

> + *
> + * If instead a send error strategy function is set, the core client library
> + * will call it to allow the client to choose a different strategy.
> + *
> + * The function takes two arguments.  The first is the display context for the
> + * error.  The second is the errno for the failure that occurred.
> + *
> + * The handler is expected to return one of the wl_send_error_strategy values
> + * to indicate how the core client library should proceed.

This is the detail that bothers me: making the callback to determine
policy. You have no idea in what context that callback will be made.

> + *
> + * If the handler returns WL_CLIENT_ERROR_ABORT, the core client code will
> + * perform the default action -- log an error and then call the C runtime
> + * abort() function.

This is the old behaviour, which we agree is not nice.

> + *
> + * If the handler returns WL_CLIENT_ERROR_FATAL, the core client code will
> + * continue, but the display context will be put into a fatal error state, and
> + * the display should no longer be used. Further calls to send messages
> + * will function, but the message will not actually be enqueued.

This is the behaviour I would prefer for unrecoverable errors. I
actually wrote that down recently:
https://gitlab.freedesktop.org/wayland/wayland/issues/12

> + *
> + * If the handler returns WL_CLIENT_ERROR_RETRY, the send operation is retried
> + * immediately on return, and may fail again.  If it does, the handler will be
> + * called again with the new errno. To avoid a busy loop, the handler should
> + * wait on the display connection, or something equivalent.  If after returning
> + * WL_CLIENT_ERROR_RETRY the send operation succeeds, the handler is called one
> + * last time with an errno of zero. This allows the handler to perform any
> + * necessary cleanup now that the send operation has succeeded. For this final
> + * call, the return value is ignored.
> + *
> + * Note that if the handler blocks, the current thread that is making the call
> + * to send a message will of course be blocked. This means the default behavior
> + * of Wayland client calls being non-blocking will no longer be true.
> + *
> + * In a multi threaded environment, this error handler could be called once for
> + * every thread that attempts to send a message on the display and encounters
> + * the error with the connection. It is the responsibility of the handler
> + * function be itself thread safe.

This third case is the red flag for me.

How could someone use this efficiently in real life? Do you have
example users for this?

How does the callback know at what kind of point of execution you are
in your application? Can you block there? Can you deal with
disconnection? What if the trigger fires inside a library you use, like
libEGL? Which holds mutexes of its own? What if it happens in a thread
you didn't expect, e.g. from GStreamer?

You'd probably need to maintain global flag variables across code
blocks. If you maintain something across code blocks, would it not be
easier instead to try a manual flush at the end of a block and see if
that works?

If flushing is inefficient, we could look into adding ABI to get
information about the buffer status, so you can make a better choice on
whether to flush or not.

If you attempt a manual flush and it fails, you will have more freedom
to handle that situation than with the callback you propose. You can do
everything you could with the callback and you could also choose to
return from the request sending code block and come back later, e.g. on
the next round of your main event loop or when then Wayland fd signals
writable in poll() again.

Interrupting what you are doing is much easier if you can pick the
points where it is possible. It is hard to be able to handle a
potential callback every time you call a request sending function.

Especially if the client is a nested server, you probably would not
want to block on the Wayland connection because that could prevent you
from servicing your own clients.

Do you think the alternative solutions could be workable for you?

I would be very happy to see someone working on those, this issue has
existed since the beginning.

Thanks,
pq

> + */
> +typedef enum wl_send_error_strategy
> +	(*wl_send_error_handler_func_t)(struct wl_display *, int /* errno */);
> +
> +void
> +wl_display_set_send_error_handler_client(struct wl_display *display,
> +				         wl_send_error_handler_func_t handler);
> +
>  void
>  wl_log_set_handler_client(wl_log_func_t handler);
>  
> diff --git a/src/wayland-client.c b/src/wayland-client.c
> index efeb745..3a9db6f 100644
> --- a/src/wayland-client.c
> +++ b/src/wayland-client.c
> @@ -113,6 +113,8 @@ struct wl_display {
>  	int reader_count;
>  	uint32_t read_serial;
>  	pthread_cond_t reader_cond;
> +
> +	wl_send_error_handler_func_t send_error_handler;
>  };
>  
>  /** \endcond */
> @@ -696,6 +698,44 @@ wl_proxy_marshal_array_constructor(struct wl_proxy *proxy,
>  							    proxy->version);
>  }
>  
> +static enum wl_send_error_strategy
> +display_send_error_handler_invoke(struct wl_display *display, int send_errno)
> +{
> +	enum wl_send_error_strategy ret = WL_SEND_ERROR_ABORT;
> +	wl_send_error_handler_func_t handler = display->send_error_handler;
> +	if (handler) {
> +		pthread_mutex_unlock(&display->mutex);
> +		ret = handler(display, send_errno);
> +		pthread_mutex_lock(&display->mutex);
> +	}
> +	return ret;
> +}
> +
> +static void
> +display_closure_send_error(struct wl_display *display,
> +	                   struct wl_closure *closure)
> +{
> +	while (true) {
> +		int send_errno = errno;
> +		switch (display_send_error_handler_invoke(display, send_errno)) {
> +		case WL_SEND_ERROR_FATAL:
> +			display_fatal_error(display, send_errno);
> +			return;
> +
> +		case WL_SEND_ERROR_RETRY:
> +			if (!wl_closure_send(closure, display->connection)) {
> +				display_send_error_handler_invoke(display, 0);
> +				return;
> +			}
> +			break;
> +
> +		case WL_SEND_ERROR_ABORT:
> +		default:
> +			wl_abort("Error sending request: %s\n",
> +				 strerror(send_errno));
> +		}
> +	}
> +}
>  
>  /** Prepare a request to be sent to the compositor
>   *
> @@ -743,6 +783,9 @@ wl_proxy_marshal_array_constructor_versioned(struct wl_proxy *proxy,
>  			goto err_unlock;
>  	}
>  
> +	if (proxy->display->last_error)
> +		goto err_unlock;
> +
>  	closure = wl_closure_marshal(&proxy->object, opcode, args, message);
>  	if (closure == NULL)
>  		wl_abort("Error marshalling request: %s\n", strerror(errno));
> @@ -751,7 +794,7 @@ wl_proxy_marshal_array_constructor_versioned(struct wl_proxy *proxy,
>  		wl_closure_print(closure, &proxy->object, true);
>  
>  	if (wl_closure_send(closure, proxy->display->connection))
> -		wl_abort("Error sending request: %s\n", strerror(errno));
> +		display_closure_send_error(proxy->display, closure);
>  
>  	wl_closure_destroy(closure);
>  
> @@ -2000,6 +2043,28 @@ wl_display_flush(struct wl_display *display)
>  	return ret;
>  }
>  
> +/** Sets the error handler to invoke when a send fails internally
> +*
> +* \param display The display context object
> +* \param handler The function to call when an error occurs.
> +*
> +* Sets the error handler to call to decide what should happen on a send error.
> +*
> +* If no handler is set, the client will use the WL_SEND_ERROR_ABORT strategy.
> +* This means the core client code will call abort() if if encounters a failure
> +* while sending a message to the server.
> +*
> +* Setting a handler allows a client implementation to alter this behavior.
> +*
> +* \memberof wl_display
> +*/
> +WL_EXPORT void
> +wl_display_set_send_error_handler_client(struct wl_display *display,
> +					 wl_send_error_handler_func_t handler)
> +{
> +	display->send_error_handler = handler;
> +}
> +
>  /** Set the user data associated with a proxy
>   *
>   * \param proxy The proxy object
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180618/cfc77474/attachment.sig>