[PATCH] client: Allow send error recovery without an abort

Tue Jun 19 01:41:28 UTC 2018

Hi Pekka!

Thank you for the feedback. In the end, I think we have a good basic
agreement about the patch, and after reading your linked bug (
https://gitlab.freedesktop.org/wayland/wayland/issues/12), which I was
previously unaware of, it sounds like I will be just stripping out the
callback+retry case from my patch.

I will probably go ahead and add a function to check for the half-full
output buffer, and one you didn't think of to check whether MAX_FDS_OUT is
about to be exceeded by the next call (that count is small enough to
actually be a concern).

At this point I think the only real question I have is one of checking the
intent behind the bug you filed. Should the design be that  ONLY behavior
is to set a fatal error on the display context, or should there be a public
API to select between the previous/default abort and the detal error?

I'll send out a revised patch after hearing back with that bit of guidance.

In case it matters, I did also include my responses to your points inline
below. I don't expect to convince that this patch should be used as is, but
perhaps it will give you or someone else some clarify and insight into why
I suggested the change I did.

Thanks!

- Lloyd

On Mon, Jun 18, 2018 at 4:58 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:

> On Tue,  5 Jun 2018 18:14:54 -0700
> Lloyd Pique <lpique at google.com> wrote:
>
> > Introduce a new call wl_display_set_error_handler_client(), which allows
> > a client to register a callback function which is invoked if there is an
> > error (possibly transient) while sending messages to the server.
> >
> > This allows a Wayland client that is actually a nested server to try and
> > recover more gracefully from send errors, allowing it one to disconnect
> > and reconnect to the server if necessary.
> >
> > The handler is called with the wl_display and the errno, and is expected
> > to return one of WL_SEND_ERROR_ABORT, WL_SEND_ERROR_FATAL, or
> > WL_SEND_ERROR_RETRY. The core-client code will then respectively abort()
> > the process (the default if no handler is set), set the display context
> > into an error state, or retry the send operation that failed.7
> >
> > The existing display test is extended to exercise the new function.
>
> Hi Lloyd,
>
> technically this looks like a quality submission, thank you for that. I
> am particularly happy to see all the new cases are added into the test
> suite. I agree that calling wl_abort() is nasty and it would be good to
> have something else.
>
> However, I'm not quite convinced of the introduced complexity here.
> More comments inline.
>
> > Signed-off-by: Lloyd Pique <lpique at google.com>
> > ---
> >  COPYING                   |   1 +
> >  src/wayland-client-core.h |  75 +++++++++++++++++
> >  src/wayland-client.c      |  67 +++++++++++++++-
> >  tests/display-test.c      | 165 +++++++++++++++++++++++++++++++++++++-
> >  4 files changed, 306 insertions(+), 2 deletions(-)
> >
> > diff --git a/COPYING b/COPYING
> > index eb25a4e..bace5d7 100644
> > --- a/COPYING
> > +++ b/COPYING
> > @@ -2,6 +2,7 @@ Copyright © 2008-2012 Kristian Høgsberg
> >  Copyright © 2010-2012 Intel Corporation
> >  Copyright © 2011 Benjamin Franzke
> >  Copyright © 2012 Collabora, Ltd.
> > +Copyright © 2018 Google LLC
> >
> >  Permission is hereby granted, free of charge, to any person obtaining a
> >  copy of this software and associated documentation files (the
> "Software"),
> > diff --git a/src/wayland-client-core.h b/src/wayland-client-core.h
> > index 03e781b..a3e4b8e 100644
> > --- a/src/wayland-client-core.h
> > +++ b/src/wayland-client-core.h
> > @@ -257,6 +257,81 @@ wl_display_cancel_read(struct wl_display *display);
> >  int
> >  wl_display_read_events(struct wl_display *display);
> >
> > +/** \enum wl_send_error_strategy
> > + *
> > + * This enum are the valid values that can be returned from a
> > + * wl_send_error_handler_func_t
> > + *
> > + * \sa wl_send_error_handler_func_t
> > + */
> > +enum wl_send_error_strategy {
> > +     /** Abort the client */
> > +     WL_SEND_ERROR_ABORT,
> > +     /** Put the display into a fatal error state */
> > +     WL_SEND_ERROR_FATAL,
> > +     /** Retry the send operation */
> > +     WL_SEND_ERROR_RETRY,
> > +};
> > +
> > +/**
> > + * \brief A function pointer type for customizing client send error
> handling
> > + *
> > + * A client send error strategy handler is a callback function which
> the client
> > + * can set to direct the core client library how to respond if an error
> is
> > + * encountered while sending a message.
> > + *
> > + * Normally messages are enqueued to an output buffer by the core client
> > + * library, and the user of the core client library is expected to call
> > + * wl_display_flush() occasionally, but as that is non-blocking,
> messages can
> > + * queue up in the output buffer.  If later calls to send messages
> happen to
> > + * fill up the output buffer, the core client library will make an
> internal
> > + * call to flush the output buffer.  But if the write call unexpectedly
> fails,
> > + * as there is no good way to return an error to the caller, the core
> client
> > + * library will log a message and call abort().
>
> The analysis of the current situation is correct.
>
> If you know you are sending lots of requests, you should be calling
> wl_display_flush() occasionally. It will return -1 if flushing fails to
> let you know you need to wait. The main problem I see with that is when
> to attempt to flush.
>

We actually do have a call to wl_display_flush() in our larger
implementation, called frequently as part of ensuring everything is
committed to the display. It does currently handle a return of -1, but by
waking up the input event processing thread in case the error is fatal, as
that thread handles disconnecting and reconnecting.

On the surface, making it detect an EAGAIN and wait kind of sounds
reasonable. But I kind of wonder how often in normal use it would return
EAGAIN, and so would force a wait that wouldn't need to, not when there is
still some space in the core client output buffer still. My reason for
adding the callback and the retry option were specifically to only make the
client block if there really was no other way to avoid an error state.

That said, I actually am willing to detect EAGAIN on wl_display_flush(),
and explicitly wait then. In trying out this patch, I only ended up
artificially reproducing an EAGAIN error (taking a few seconds to trigger)
when I magnified the request traffic by a factor of about 1000x. That is a
pretty unreasonable magnification even for a nested server, so while it
could happen, I don't think this is actually that important to my project.

In our case, the Wayland server we connect to is actually a small part of a
much larger Chrome UI process on ChromeOS. Despite being a user-space
process, it is not likely to be delayed in its processing of client
requests, and it seemed much more likely that the connection was dropping
for other reasons. Unfortunately I have no data on correlating our
anonymous crash reports (call stack only) with any data on why the Chrome
processes might have restarted (update? crash?). At least a clean reconnect
is much better than a wl_abort() for us, though still not a nice user
experience.

> I remember some discussion from the past, where we would make the
> soft-buffer in libwayland-client larger and introduce a check API which
> would return whether, say, more than half of the space is already
> consumed. That would work as a trigger for an extra flush at a point
> convenient to the client.

> I remember discussing something like that with Daniel Stone, but I
> can't find it. Might have been in IRC.
>
> Or maybe it was more like, double the soft-buffer size, attempt an
> automatic flush if it is more than half-full, and if that fails,
> continue and raise a flag that the user can query at a convenient time
> and do the dance for manual flush and poll for writable. The remaining
> soft-buffer should hopefully allow the manual flush before the buffer
> runs out.
>

I believe I saw the exchange at one point in an IRC archive while looking
for existing known problems, though I though I remember seeing it for the
receive buffers.

Thinking about it fresh, Maybe that is reasonable as a general option for
clients. The one caveat I could think of has to do with MAX_FDS_OUT and it
being only 28. My project does use the linux-dmabuf-unstable-v1 protocol,
and having more than 14 surfaces in a nested server seems like it could
actually happen often. And on second thought, I could see my client filling
up all 28 in one frame, thanks to perhaps an ill-timed flood of
notifications in addition to normal UI surfaces.

Hmm, the patch I've uploaded does catch that, but if I remove the RETRY
option I'll think I will need to add another special case
wl_display_flush() check when creating buffers. In my case I would actually
prefer to find out that the fd buffer is completely full before
waiting/flushing though.

> > + *
> > + * If instead a send error strategy function is set, the core client
> library
> > + * will call it to allow the client to choose a different strategy.
> > + *
> > + * The function takes two arguments.  The first is the display context
> for the
> > + * error.  The second is the errno for the failure that occurred.
> > + *
> > + * The handler is expected to return one of the wl_send_error_strategy
> values
> > + * to indicate how the core client library should proceed.
>
> This is the detail that bothers me: making the callback to determine
> policy. You have no idea in what context that callback will be made.
>

Sure, but I meant this to be a choice available to the author of the larger
client, assuming they had full knowledge of what the consequences of doing
anything non-trivial were.

> > + *
> > + * If the handler returns WL_CLIENT_ERROR_ABORT, the core client code
> will
> > + * perform the default action -- log an error and then call the C
> runtime
> > + * abort() function.
>
> This is the old behaviour, which we agree is not nice.
>
> > + *
> > + * If the handler returns WL_CLIENT_ERROR_FATAL, the core client code
> will
> > + * continue, but the display context will be put into a fatal error
> state, and
> > + * the display should no longer be used. Further calls to send messages
> > + * will function, but the message will not actually be enqueued.
>
> This is the behaviour I would prefer for unrecoverable errors. I
> actually wrote that down recently:
> https://gitlab.freedesktop.org/wayland/wayland/issues/12
>
>
I hadn't seen that bug before now, and it exactly matches the conditions I
observed. Hence my above response that I will go ahead and strip down my
change to something simpler.

>
> > + *
> > + * If the handler returns WL_CLIENT_ERROR_RETRY, the send operation is
> retried
> > + * immediately on return, and may fail again.  If it does, the handler
> will be
> > + * called again with the new errno. To avoid a busy loop, the handler
> should
> > + * wait on the display connection, or something equivalent.  If after
> returning
> > + * WL_CLIENT_ERROR_RETRY the send operation succeeds, the handler is
> called one
> > + * last time with an errno of zero. This allows the handler to perform
> any
> > + * necessary cleanup now that the send operation has succeeded. For
> this final
> > + * call, the return value is ignored.
> > + *
> > + * Note that if the handler blocks, the current thread that is making
> the call
> > + * to send a message will of course be blocked. This means the default
> behavior
> > + * of Wayland client calls being non-blocking will no longer be true.
> > + *
> > + * In a multi threaded environment, this error handler could be called
> once for
> > + * every thread that attempts to send a message on the display and
> encounters
> > + * the error with the connection. It is the responsibility of the
> handler
> > + * function be itself thread safe.
>
> This third case is the red flag for me.
>
> How could someone use this efficiently in real life? Do you have
> example users for this?
>

Unfortunately that code is not open source, otherwise I would point you at
it.

My actual use case is a self-contained process that happens to use Wayland
for input and output, and that the use of Wayland there is an internal
implementation detail. While my client is a nested server, it doesn't
expose a Wayland interface to its clients -- I had to support another
interface entirely.

To me there only seems to be one real general pattern for the callback
handler when might implement the retry strategy. It is just slightly more
complex than the retry_with_poll_handler() from the test I added.

To sketch it out in quick pseudo-code, it would be:

int typical_retry_handler(struct wl_display *display, int send_errno) {

if (send_errno == 0) { /* retry succeeded */

/* clear retry start timestamp */

}

if (send_errno != EAGAIN) return WL_SEND_ERROR_FATAL;
if ( /* retry timestamp is not yet set */ ) {

/* set it */

}

if (now - timestamp > limit) return WL_SEND_ERROR_FATAL;

/* poll, using the timestamp to set a timeout */

return WL_SEND_ERROR_RETRY;

}

If your objection is that you would rather this handler be standardized, I
can move it to the the only patch, use the existing core client mutex, and
having the timestamp be part of the wl_display state. Multiple threads
would block for on the send, but they would have done so anyway. Then there
could be a public api that selected between the default (abort), error, or
error+retry implemented by the above pseudocode.

But as you said earlier, I will really instead have the wl_client_flush()
do the poll().

>
> How does the callback know at what kind of point of execution you are
> in your application? Can you block there? Can you deal with
> disconnection? What if the trigger fires inside a library you use, like
> libEGL? Which holds mutexes of its own? What if it happens in a thread
> you didn't expect, e.g. from GStreamer?
>
> You'd probably need to maintain global flag variables across code
> blocks. If you maintain something across code blocks, would it not be
> easier instead to try a manual flush at the end of a block and see if
> that works?
>

I agree that the retry option would only be useful in fairly restricted
cases, and not all clients could actually use it.

For the client I am dealing with, it is a valid option -- we know where all
the send calls are made, and are aware of the consequences of holding the
mutexes we do hold, and we are willing to block rather to immediately fail
the connection if it might recover.

But again, certainly cannot justify this part of the change by saying we
need it, because I'm fairly certain we do not (though perhaps I do need to
worry about fd's). For simplicity then I see no point really in arguing
further, and will strip out the retry option from my patch.

> If flushing is inefficient, we could look into adding ABI to get
> information about the buffer status, so you can make a better choice on
> whether to flush or not.
>
> If you attempt a manual flush and it fails, you will have more freedom
> to handle that situation than with the callback you propose. You can do
> everything you could with the callback and you could also choose to
> return from the request sending code block and come back later, e.g. on
> the next round of your main event loop or when then Wayland fd signals
> writable in poll() again.

> Interrupting what you are doing is much easier if you can pick the
> points where it is possible. It is hard to be able to handle a
> potential callback every time you call a request sending function.

Especially if the client is a nested server, you probably would not
> want to block on the Wayland connection because that could prevent you
> from servicing your own clients.
>
>
Interestingly while having that block does prevent the display from being
updated, it doesn't (immediately) block our clients, though they might
starve for buffers not being released fairly soon after. As a still
relatively uncommon error case, blocking briefly (and then
disconnecting/reconnecting if needed) is better than taking down everyone
due to the wl_abort.

> Do you think the alternative solutions could be workable for you?
>

As above, yes.

>
> I would be very happy to see someone working on those, this issue has
> existed since the beginning.
>
>
I'm happy to help.

>
> Thanks,
> pq
>
>
> > + */
> > +typedef enum wl_send_error_strategy
> > +     (*wl_send_error_handler_func_t)(struct wl_display *, int /* errno
> */);
> > +
> > +void
> > +wl_display_set_send_error_handler_client(struct wl_display *display,
> > +                                      wl_send_error_handler_func_t
> handler);
> > +
> >  void
> >  wl_log_set_handler_client(wl_log_func_t handler);
> >
> > diff --git a/src/wayland-client.c b/src/wayland-client.c
> > index efeb745..3a9db6f 100644
> > --- a/src/wayland-client.c
> > +++ b/src/wayland-client.c
> > @@ -113,6 +113,8 @@ struct wl_display {
> >       int reader_count;
> >       uint32_t read_serial;
> >       pthread_cond_t reader_cond;
> > +
> > +     wl_send_error_handler_func_t send_error_handler;
> >  };
> >
> >  /** \endcond */
> > @@ -696,6 +698,44 @@ wl_proxy_marshal_array_constructor(struct wl_proxy
> *proxy,
> >
>  proxy->version);
> >  }
> >
> > +static enum wl_send_error_strategy
> > +display_send_error_handler_invoke(struct wl_display *display, int
> send_errno)
> > +{
> > +     enum wl_send_error_strategy ret = WL_SEND_ERROR_ABORT;
> > +     wl_send_error_handler_func_t handler = display->send_error_handler;
> > +     if (handler) {
> > +             pthread_mutex_unlock(&display->mutex);
> > +             ret = handler(display, send_errno);
> > +             pthread_mutex_lock(&display->mutex);
> > +     }
> > +     return ret;
> > +}
> > +
> > +static void
> > +display_closure_send_error(struct wl_display *display,
> > +                        struct wl_closure *closure)
> > +{
> > +     while (true) {
> > +             int send_errno = errno;
> > +             switch (display_send_error_handler_invoke(display,
> send_errno)) {
> > +             case WL_SEND_ERROR_FATAL:
> > +                     display_fatal_error(display, send_errno);
> > +                     return;
> > +
> > +             case WL_SEND_ERROR_RETRY:
> > +                     if (!wl_closure_send(closure,
> display->connection)) {
> > +                             display_send_error_handler_invoke(display,
> 0);
> > +                             return;
> > +                     }
> > +                     break;
> > +
> > +             case WL_SEND_ERROR_ABORT:
> > +             default:
> > +                     wl_abort("Error sending request: %s\n",
> > +                              strerror(send_errno));
> > +             }
> > +     }
> > +}
> >
> >  /** Prepare a request to be sent to the compositor
> >   *
> > @@ -743,6 +783,9 @@ wl_proxy_marshal_array_constructor_versioned(struct
> wl_proxy *proxy,
> >                       goto err_unlock;
> >       }
> >
> > +     if (proxy->display->last_error)
> > +             goto err_unlock;
> > +
> >       closure = wl_closure_marshal(&proxy->object, opcode, args,
> message);
> >       if (closure == NULL)
> >               wl_abort("Error marshalling request: %s\n",
> strerror(errno));
> > @@ -751,7 +794,7 @@ wl_proxy_marshal_array_constructor_versioned(struct
> wl_proxy *proxy,
> >               wl_closure_print(closure, &proxy->object, true);
> >
> >       if (wl_closure_send(closure, proxy->display->connection))
> > -             wl_abort("Error sending request: %s\n", strerror(errno));
> > +             display_closure_send_error(proxy->display, closure);
> >
> >       wl_closure_destroy(closure);
> >
> > @@ -2000,6 +2043,28 @@ wl_display_flush(struct wl_display *display)
> >       return ret;
> >  }
> >
> > +/** Sets the error handler to invoke when a send fails internally
> > +*
> > +* \param display The display context object
> > +* \param handler The function to call when an error occurs.
> > +*
> > +* Sets the error handler to call to decide what should happen on a send
> error.
> > +*
> > +* If no handler is set, the client will use the WL_SEND_ERROR_ABORT
> strategy.
> > +* This means the core client code will call abort() if if encounters a
> failure
> > +* while sending a message to the server.
> > +*
> > +* Setting a handler allows a client implementation to alter this
> behavior.
> > +*
> > +* \memberof wl_display
> > +*/
> > +WL_EXPORT void
> > +wl_display_set_send_error_handler_client(struct wl_display *display,
> > +                                      wl_send_error_handler_func_t
> handler)
> > +{
> > +     display->send_error_handler = handler;
> > +}
> > +
> >  /** Set the user data associated with a proxy
> >   *
> >   * \param proxy The proxy object
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180618/5b1ce806/attachment-0001.html>