[PATCH] client: Allow send error recovery without an abort

Tue Jun 19 02:54:23 UTC 2018

Let me take things back a step. I was a bit too hasty in suggesting
something that would work for me for the fact that MAX_FDS_OUT is small. In
our client the buffer creation ends up being serialized, and so only one
thread will be creating buffers with the linux-dmabuf protocol at a time.

That might not be true for other clients, and so checking if the fd buffer
is almost full won't work. In fact I'm not so certain checking if it is
half full would work either... having more than 14 threads trying to create
such buffers at the same time does not seem unreasonable, and if more than
one fd were involved in each request (multi-plane formats), the number of
threads needed to hit the wl_abort anyway drops quickly. And there may be
some other protocol I'm not aware of that is worse.

Increasing the fd buffer would alleviate that. However it would introduce
another problem.

We also have wl_abort()'s from the call to wl_os_dupfd_cloexec() in
wl_closure_marshal() failing. Not very often, and we are willing to pass on
finding a fix for it for now, but increasing the number of fd's being held
for the transfer is definitely going to make that worse.

Perhaps making the writes be blocking is the only reasonable way after all?

What do you think?

On Mon, Jun 18, 2018 at 6:41 PM Lloyd Pique <lpique at google.com> wrote:

> Hi Pekka!
>
> Thank you for the feedback. In the end, I think we have a good basic
> agreement about the patch, and after reading your linked bug (
> https://gitlab.freedesktop.org/wayland/wayland/issues/12), which I was
> previously unaware of, it sounds like I will be just stripping out the
> callback+retry case from my patch.
>
> I will probably go ahead and add a function to check for the half-full
> output buffer, and one you didn't think of to check whether MAX_FDS_OUT is
> about to be exceeded by the next call (that count is small enough to
> actually be a concern).
>
> At this point I think the only real question I have is one of checking the
> intent behind the bug you filed. Should the design be that  ONLY behavior
> is to set a fatal error on the display context, or should there be a public
> API to select between the previous/default abort and the detal error?
>
> I'll send out a revised patch after hearing back with that bit of guidance.
>
> In case it matters, I did also include my responses to your points inline
> below. I don't expect to convince that this patch should be used as is, but
> perhaps it will give you or someone else some clarify and insight into why
> I suggested the change I did.
>
> Thanks!
>
> - Lloyd
>
> On Mon, Jun 18, 2018 at 4:58 AM Pekka Paalanen <ppaalanen at gmail.com>
> wrote:
>
>> On Tue,  5 Jun 2018 18:14:54 -0700
>> Lloyd Pique <lpique at google.com> wrote:
>>
>> > Introduce a new call wl_display_set_error_handler_client(), which allows
>> > a client to register a callback function which is invoked if there is an
>> > error (possibly transient) while sending messages to the server.
>> >
>> > This allows a Wayland client that is actually a nested server to try and
>> > recover more gracefully from send errors, allowing it one to disconnect
>> > and reconnect to the server if necessary.
>> >
>> > The handler is called with the wl_display and the errno, and is expected
>> > to return one of WL_SEND_ERROR_ABORT, WL_SEND_ERROR_FATAL, or
>> > WL_SEND_ERROR_RETRY. The core-client code will then respectively abort()
>> > the process (the default if no handler is set), set the display context
>> > into an error state, or retry the send operation that failed.7
>> >
>> > The existing display test is extended to exercise the new function.
>>
>> Hi Lloyd,
>>
>> technically this looks like a quality submission, thank you for that. I
>> am particularly happy to see all the new cases are added into the test
>> suite. I agree that calling wl_abort() is nasty and it would be good to
>> have something else.
>>
>> However, I'm not quite convinced of the introduced complexity here.
>> More comments inline.
>>
>> > Signed-off-by: Lloyd Pique <lpique at google.com>
>> > ---
>> >  COPYING                   |   1 +
>> >  src/wayland-client-core.h |  75 +++++++++++++++++
>> >  src/wayland-client.c      |  67 +++++++++++++++-
>> >  tests/display-test.c      | 165 +++++++++++++++++++++++++++++++++++++-
>> >  4 files changed, 306 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/COPYING b/COPYING
>> > index eb25a4e..bace5d7 100644
>> > --- a/COPYING
>> > +++ b/COPYING
>> > @@ -2,6 +2,7 @@ Copyright © 2008-2012 Kristian Høgsberg
>> >  Copyright © 2010-2012 Intel Corporation
>> >  Copyright © 2011 Benjamin Franzke
>> >  Copyright © 2012 Collabora, Ltd.
>> > +Copyright © 2018 Google LLC
>> >
>> >  Permission is hereby granted, free of charge, to any person obtaining a
>> >  copy of this software and associated documentation files (the
>> "Software"),
>> > diff --git a/src/wayland-client-core.h b/src/wayland-client-core.h
>> > index 03e781b..a3e4b8e 100644
>> > --- a/src/wayland-client-core.h
>> > +++ b/src/wayland-client-core.h
>> > @@ -257,6 +257,81 @@ wl_display_cancel_read(struct wl_display *display);
>> >  int
>> >  wl_display_read_events(struct wl_display *display);
>> >
>> > +/** \enum wl_send_error_strategy
>> > + *
>> > + * This enum are the valid values that can be returned from a
>> > + * wl_send_error_handler_func_t
>> > + *
>> > + * \sa wl_send_error_handler_func_t
>> > + */
>> > +enum wl_send_error_strategy {
>> > +     /** Abort the client */
>> > +     WL_SEND_ERROR_ABORT,
>> > +     /** Put the display into a fatal error state */
>> > +     WL_SEND_ERROR_FATAL,
>> > +     /** Retry the send operation */
>> > +     WL_SEND_ERROR_RETRY,
>> > +};
>> > +
>> > +/**
>> > + * \brief A function pointer type for customizing client send error
>> handling
>> > + *
>> > + * A client send error strategy handler is a callback function which
>> the client
>> > + * can set to direct the core client library how to respond if an
>> error is
>> > + * encountered while sending a message.
>> > + *
>> > + * Normally messages are enqueued to an output buffer by the core
>> client
>> > + * library, and the user of the core client library is expected to call
>> > + * wl_display_flush() occasionally, but as that is non-blocking,
>> messages can
>> > + * queue up in the output buffer.  If later calls to send messages
>> happen to
>> > + * fill up the output buffer, the core client library will make an
>> internal
>> > + * call to flush the output buffer.  But if the write call
>> unexpectedly fails,
>> > + * as there is no good way to return an error to the caller, the core
>> client
>> > + * library will log a message and call abort().
>>
>> The analysis of the current situation is correct.
>>
>> If you know you are sending lots of requests, you should be calling
>> wl_display_flush() occasionally. It will return -1 if flushing fails to
>> let you know you need to wait. The main problem I see with that is when
>> to attempt to flush.
>>
>
> We actually do have a call to wl_display_flush() in our larger
> implementation, called frequently as part of ensuring everything is
> committed to the display. It does currently handle a return of -1, but by
> waking up the input event processing thread in case the error is fatal, as
> that thread handles disconnecting and reconnecting.
>
> On the surface, making it detect an EAGAIN and wait kind of sounds
> reasonable. But I kind of wonder how often in normal use it would return
> EAGAIN, and so would force a wait that wouldn't need to, not when there is
> still some space in the core client output buffer still. My reason for
> adding the callback and the retry option were specifically to only make the
> client block if there really was no other way to avoid an error state.
>
> That said, I actually am willing to detect EAGAIN on wl_display_flush(),
> and explicitly wait then. In trying out this patch, I only ended up
> artificially reproducing an EAGAIN error (taking a few seconds to trigger)
> when I magnified the request traffic by a factor of about 1000x. That is a
> pretty unreasonable magnification even for a nested server, so while it
> could happen, I don't think this is actually that important to my project.
>
> In our case, the Wayland server we connect to is actually a small part of
> a much larger Chrome UI process on ChromeOS. Despite being a user-space
> process, it is not likely to be delayed in its processing of client
> requests, and it seemed much more likely that the connection was dropping
> for other reasons. Unfortunately I have no data on correlating our
> anonymous crash reports (call stack only) with any data on why the Chrome
> processes might have restarted (update? crash?). At least a clean reconnect
> is much better than a wl_abort() for us, though still not a nice user
> experience.
>
>
>> I remember some discussion from the past, where we would make the
>> soft-buffer in libwayland-client larger and introduce a check API which
>> would return whether, say, more than half of the space is already
>> consumed. That would work as a trigger for an extra flush at a point
>> convenient to the client.
>
>
>> I remember discussing something like that with Daniel Stone, but I
>> can't find it. Might have been in IRC.
>>
>> Or maybe it was more like, double the soft-buffer size, attempt an
>> automatic flush if it is more than half-full, and if that fails,
>> continue and raise a flag that the user can query at a convenient time
>> and do the dance for manual flush and poll for writable. The remaining
>> soft-buffer should hopefully allow the manual flush before the buffer
>> runs out.
>>
>
> I believe I saw the exchange at one point in an IRC archive while looking
> for existing known problems, though I though I remember seeing it for the
> receive buffers.
>
> Thinking about it fresh, Maybe that is reasonable as a general option for
> clients. The one caveat I could think of has to do with MAX_FDS_OUT and it
> being only 28. My project does use the linux-dmabuf-unstable-v1 protocol,
> and having more than 14 surfaces in a nested server seems like it could
> actually happen often. And on second thought, I could see my client filling
> up all 28 in one frame, thanks to perhaps an ill-timed flood of
> notifications in addition to normal UI surfaces.
>
> Hmm, the patch I've uploaded does catch that, but if I remove the RETRY
> option I'll think I will need to add another special case
> wl_display_flush() check when creating buffers. In my case I would actually
> prefer to find out that the fd buffer is completely full before
> waiting/flushing though.
>
>
>> > + *
>> > + * If instead a send error strategy function is set, the core client
>> library
>> > + * will call it to allow the client to choose a different strategy.
>> > + *
>> > + * The function takes two arguments.  The first is the display context
>> for the
>> > + * error.  The second is the errno for the failure that occurred.
>> > + *
>> > + * The handler is expected to return one of the wl_send_error_strategy
>> values
>> > + * to indicate how the core client library should proceed.
>>
>> This is the detail that bothers me: making the callback to determine
>> policy. You have no idea in what context that callback will be made.
>>
>
> Sure, but I meant this to be a choice available to the author of the
> larger client, assuming they had full knowledge of what the consequences of
> doing anything non-trivial were.
>
>
>> > + *
>> > + * If the handler returns WL_CLIENT_ERROR_ABORT, the core client code
>> will
>> > + * perform the default action -- log an error and then call the C
>> runtime
>> > + * abort() function.
>>
>> This is the old behaviour, which we agree is not nice.
>>
>> > + *
>> > + * If the handler returns WL_CLIENT_ERROR_FATAL, the core client code
>> will
>> > + * continue, but the display context will be put into a fatal error
>> state, and
>> > + * the display should no longer be used. Further calls to send messages
>> > + * will function, but the message will not actually be enqueued.
>>
>> This is the behaviour I would prefer for unrecoverable errors. I
>> actually wrote that down recently:
>> https://gitlab.freedesktop.org/wayland/wayland/issues/12
>>
>>
> I hadn't seen that bug before now, and it exactly matches the conditions I
> observed. Hence my above response that I will go ahead and strip down my
> change to something simpler.
>
>
>>
>> > + *
>> > + * If the handler returns WL_CLIENT_ERROR_RETRY, the send operation is
>> retried
>> > + * immediately on return, and may fail again.  If it does, the handler
>> will be
>> > + * called again with the new errno. To avoid a busy loop, the handler
>> should
>> > + * wait on the display connection, or something equivalent.  If after
>> returning
>> > + * WL_CLIENT_ERROR_RETRY the send operation succeeds, the handler is
>> called one
>> > + * last time with an errno of zero. This allows the handler to perform
>> any
>> > + * necessary cleanup now that the send operation has succeeded. For
>> this final
>> > + * call, the return value is ignored.
>> > + *
>> > + * Note that if the handler blocks, the current thread that is making
>> the call
>> > + * to send a message will of course be blocked. This means the default
>> behavior
>> > + * of Wayland client calls being non-blocking will no longer be true.
>> > + *
>> > + * In a multi threaded environment, this error handler could be called
>> once for
>> > + * every thread that attempts to send a message on the display and
>> encounters
>> > + * the error with the connection. It is the responsibility of the
>> handler
>> > + * function be itself thread safe.
>>
>> This third case is the red flag for me.
>>
>> How could someone use this efficiently in real life? Do you have
>> example users for this?
>>
>
> Unfortunately that code is not open source, otherwise I would point you at
> it.
>
> My actual use case is a self-contained process that happens to use Wayland
> for input and output, and that the use of Wayland there is an internal
> implementation detail. While my client is a nested server, it doesn't
> expose a Wayland interface to its clients -- I had to support another
> interface entirely.
>
> To me there only seems to be one real general pattern for the callback
> handler when might implement the retry strategy. It is just slightly more
> complex than the retry_with_poll_handler() from the test I added.
>
> To sketch it out in quick pseudo-code, it would be:
>
>
> int typical_retry_handler(struct wl_display *display, int send_errno) {
>
> if (send_errno == 0) { /* retry succeeded */
>
> /* clear retry start timestamp */
>
> }
>
> if (send_errno != EAGAIN) return WL_SEND_ERROR_FATAL;
> if ( /* retry timestamp is not yet set */ ) {
>
> /* set it */
>
> }
>
> if (now - timestamp > limit) return WL_SEND_ERROR_FATAL;
>
> /* poll, using the timestamp to set a timeout */
>
> return WL_SEND_ERROR_RETRY;
>
> }
>
>
> If your objection is that you would rather this handler be standardized, I
> can move it to the the only patch, use the existing core client mutex, and
> having the timestamp be part of the wl_display state. Multiple threads
> would block for on the send, but they would have done so anyway. Then there
> could be a public api that selected between the default (abort), error, or
> error+retry implemented by the above pseudocode.
>
> But as you said earlier, I will really instead have the wl_client_flush()
> do the poll().
>
>
>>
>> How does the callback know at what kind of point of execution you are
>> in your application? Can you block there? Can you deal with
>> disconnection? What if the trigger fires inside a library you use, like
>> libEGL? Which holds mutexes of its own? What if it happens in a thread
>> you didn't expect, e.g. from GStreamer?
>>
>> You'd probably need to maintain global flag variables across code
>> blocks. If you maintain something across code blocks, would it not be
>> easier instead to try a manual flush at the end of a block and see if
>> that works?
>>
>
> I agree that the retry option would only be useful in fairly restricted
> cases, and not all clients could actually use it.
>
> For the client I am dealing with, it is a valid option -- we know where
> all the send calls are made, and are aware of the consequences of holding
> the mutexes we do hold, and we are willing to block rather to immediately
> fail the connection if it might recover.
>
> But again, certainly cannot justify this part of the change by saying we
> need it, because I'm fairly certain we do not (though perhaps I do need to
> worry about fd's). For simplicity then I see no point really in arguing
> further, and will strip out the retry option from my patch.
>
>
>> If flushing is inefficient, we could look into adding ABI to get
>> information about the buffer status, so you can make a better choice on
>> whether to flush or not.
>>
>> If you attempt a manual flush and it fails, you will have more freedom
>> to handle that situation than with the callback you propose. You can do
>> everything you could with the callback and you could also choose to
>> return from the request sending code block and come back later, e.g. on
>> the next round of your main event loop or when then Wayland fd signals
>> writable in poll() again.
>
>
>> Interrupting what you are doing is much easier if you can pick the
>> points where it is possible. It is hard to be able to handle a
>> potential callback every time you call a request sending function.
>
>
>
> Especially if the client is a nested server, you probably would not
>> want to block on the Wayland connection because that could prevent you
>> from servicing your own clients.
>>
>>
> Interestingly while having that block does prevent the display from being
> updated, it doesn't (immediately) block our clients, though they might
> starve for buffers not being released fairly soon after. As a still
> relatively uncommon error case, blocking briefly (and then
> disconnecting/reconnecting if needed) is better than taking down everyone
> due to the wl_abort.
>
>
>> Do you think the alternative solutions could be workable for you?
>>
>
> As above, yes.
>
>
>>
>> I would be very happy to see someone working on those, this issue has
>> existed since the beginning.
>>
>>
> I'm happy to help.
>
>
>>
>> Thanks,
>> pq
>>
>>
>> > + */
>> > +typedef enum wl_send_error_strategy
>> > +     (*wl_send_error_handler_func_t)(struct wl_display *, int /* errno
>> */);
>> > +
>> > +void
>> > +wl_display_set_send_error_handler_client(struct wl_display *display,
>> > +                                      wl_send_error_handler_func_t
>> handler);
>> > +
>> >  void
>> >  wl_log_set_handler_client(wl_log_func_t handler);
>> >
>> > diff --git a/src/wayland-client.c b/src/wayland-client.c
>> > index efeb745..3a9db6f 100644
>> > --- a/src/wayland-client.c
>> > +++ b/src/wayland-client.c
>> > @@ -113,6 +113,8 @@ struct wl_display {
>> >       int reader_count;
>> >       uint32_t read_serial;
>> >       pthread_cond_t reader_cond;
>> > +
>> > +     wl_send_error_handler_func_t send_error_handler;
>> >  };
>> >
>> >  /** \endcond */
>> > @@ -696,6 +698,44 @@ wl_proxy_marshal_array_constructor(struct wl_proxy
>> *proxy,
>> >
>>  proxy->version);
>> >  }
>> >
>> > +static enum wl_send_error_strategy
>> > +display_send_error_handler_invoke(struct wl_display *display, int
>> send_errno)
>> > +{
>> > +     enum wl_send_error_strategy ret = WL_SEND_ERROR_ABORT;
>> > +     wl_send_error_handler_func_t handler =
>> display->send_error_handler;
>> > +     if (handler) {
>> > +             pthread_mutex_unlock(&display->mutex);
>> > +             ret = handler(display, send_errno);
>> > +             pthread_mutex_lock(&display->mutex);
>> > +     }
>> > +     return ret;
>> > +}
>> > +
>> > +static void
>> > +display_closure_send_error(struct wl_display *display,
>> > +                        struct wl_closure *closure)
>> > +{
>> > +     while (true) {
>> > +             int send_errno = errno;
>> > +             switch (display_send_error_handler_invoke(display,
>> send_errno)) {
>> > +             case WL_SEND_ERROR_FATAL:
>> > +                     display_fatal_error(display, send_errno);
>> > +                     return;
>> > +
>> > +             case WL_SEND_ERROR_RETRY:
>> > +                     if (!wl_closure_send(closure,
>> display->connection)) {
>> > +
>>  display_send_error_handler_invoke(display, 0);
>> > +                             return;
>> > +                     }
>> > +                     break;
>> > +
>> > +             case WL_SEND_ERROR_ABORT:
>> > +             default:
>> > +                     wl_abort("Error sending request: %s\n",
>> > +                              strerror(send_errno));
>> > +             }
>> > +     }
>> > +}
>> >
>> >  /** Prepare a request to be sent to the compositor
>> >   *
>> > @@ -743,6 +783,9 @@ wl_proxy_marshal_array_constructor_versioned(struct
>> wl_proxy *proxy,
>> >                       goto err_unlock;
>> >       }
>> >
>> > +     if (proxy->display->last_error)
>> > +             goto err_unlock;
>> > +
>> >       closure = wl_closure_marshal(&proxy->object, opcode, args,
>> message);
>> >       if (closure == NULL)
>> >               wl_abort("Error marshalling request: %s\n",
>> strerror(errno));
>> > @@ -751,7 +794,7 @@ wl_proxy_marshal_array_constructor_versioned(struct
>> wl_proxy *proxy,
>> >               wl_closure_print(closure, &proxy->object, true);
>> >
>> >       if (wl_closure_send(closure, proxy->display->connection))
>> > -             wl_abort("Error sending request: %s\n", strerror(errno));
>> > +             display_closure_send_error(proxy->display, closure);
>> >
>> >       wl_closure_destroy(closure);
>> >
>> > @@ -2000,6 +2043,28 @@ wl_display_flush(struct wl_display *display)
>> >       return ret;
>> >  }
>> >
>> > +/** Sets the error handler to invoke when a send fails internally
>> > +*
>> > +* \param display The display context object
>> > +* \param handler The function to call when an error occurs.
>> > +*
>> > +* Sets the error handler to call to decide what should happen on a
>> send error.
>> > +*
>> > +* If no handler is set, the client will use the WL_SEND_ERROR_ABORT
>> strategy.
>> > +* This means the core client code will call abort() if if encounters a
>> failure
>> > +* while sending a message to the server.
>> > +*
>> > +* Setting a handler allows a client implementation to alter this
>> behavior.
>> > +*
>> > +* \memberof wl_display
>> > +*/
>> > +WL_EXPORT void
>> > +wl_display_set_send_error_handler_client(struct wl_display *display,
>> > +                                      wl_send_error_handler_func_t
>> handler)
>> > +{
>> > +     display->send_error_handler = handler;
>> > +}
>> > +
>> >  /** Set the user data associated with a proxy
>> >   *
>> >   * \param proxy The proxy object
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180618/46568325/attachment-0001.html>