[PATCH] client: Allow send error recovery without an abort

Tue Jun 19 12:12:08 UTC 2018

On Mon, 18 Jun 2018 18:41:28 -0700
Lloyd Pique <lpique at google.com> wrote:

> Hi Pekka!
> 
> Thank you for the feedback. In the end, I think we have a good basic
> agreement about the patch, and after reading your linked bug (
> https://gitlab.freedesktop.org/wayland/wayland/issues/12), which I was
> previously unaware of, it sounds like I will be just stripping out the
> callback+retry case from my patch.
> 
> I will probably go ahead and add a function to check for the half-full
> output buffer, and one you didn't think of to check whether MAX_FDS_OUT is
> about to be exceeded by the next call (that count is small enough to
> actually be a concern).

Sounds good. I'll reply to your other email too.

> At this point I think the only real question I have is one of checking the
> intent behind the bug you filed. Should the design be that  ONLY behavior
> is to set a fatal error on the display context, or should there be a public
> API to select between the previous/default abort and the detal error?

I think the aim should be to remove the abort(). I do not see any
reason to leave it in even as an option.

> 
> I'll send out a revised patch after hearing back with that bit of guidance.
> 
> In case it matters, I did also include my responses to your points inline
> below. I don't expect to convince that this patch should be used as is, but
> perhaps it will give you or someone else some clarify and insight into why
> I suggested the change I did.

Nice, thank you for that. Some more below.

> On Mon, Jun 18, 2018 at 4:58 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > On Tue,  5 Jun 2018 18:14:54 -0700
> > Lloyd Pique <lpique at google.com> wrote:
> >  
> > > Introduce a new call wl_display_set_error_handler_client(), which allows
> > > a client to register a callback function which is invoked if there is an
> > > error (possibly transient) while sending messages to the server.
> > >
> > > This allows a Wayland client that is actually a nested server to try and
> > > recover more gracefully from send errors, allowing it one to disconnect
> > > and reconnect to the server if necessary.
> > >
> > > The handler is called with the wl_display and the errno, and is expected
> > > to return one of WL_SEND_ERROR_ABORT, WL_SEND_ERROR_FATAL, or
> > > WL_SEND_ERROR_RETRY. The core-client code will then respectively abort()
> > > the process (the default if no handler is set), set the display context
> > > into an error state, or retry the send operation that failed.7
> > >
> > > The existing display test is extended to exercise the new function.  
> >
> > Hi Lloyd,
> >
> > technically this looks like a quality submission, thank you for that. I
> > am particularly happy to see all the new cases are added into the test
> > suite. I agree that calling wl_abort() is nasty and it would be good to
> > have something else.
> >
> > However, I'm not quite convinced of the introduced complexity here.
> > More comments inline.

> > If you know you are sending lots of requests, you should be calling
> > wl_display_flush() occasionally. It will return -1 if flushing fails to
> > let you know you need to wait. The main problem I see with that is when
> > to attempt to flush.
> >  
> 
> We actually do have a call to wl_display_flush() in our larger
> implementation, called frequently as part of ensuring everything is
> committed to the display. It does currently handle a return of -1, but by
> waking up the input event processing thread in case the error is fatal, as
> that thread handles disconnecting and reconnecting.
> 
> On the surface, making it detect an EAGAIN and wait kind of sounds
> reasonable. But I kind of wonder how often in normal use it would return
> EAGAIN, and so would force a wait that wouldn't need to, not when there is
> still some space in the core client output buffer still. My reason for
> adding the callback and the retry option were specifically to only make the
> client block if there really was no other way to avoid an error state.

The soft-buffer in libwayland-client is quite small, the kernel socket
buffers are usually much larger in bytes. EAGAIN is triggered only when
the flush to the kernel fails, so I don't think that would happen
needlessly.

> That said, I actually am willing to detect EAGAIN on wl_display_flush(),
> and explicitly wait then. In trying out this patch, I only ended up
> artificially reproducing an EAGAIN error (taking a few seconds to trigger)
> when I magnified the request traffic by a factor of about 1000x. That is a
> pretty unreasonable magnification even for a nested server, so while it
> could happen, I don't think this is actually that important to my project.

> Thinking about it fresh, Maybe that is reasonable as a general option for
> clients. The one caveat I could think of has to do with MAX_FDS_OUT and it
> being only 28. My project does use the linux-dmabuf-unstable-v1 protocol,
> and having more than 14 surfaces in a nested server seems like it could
> actually happen often. And on second thought, I could see my client filling
> up all 28 in one frame, thanks to perhaps an ill-timed flood of
> notifications in addition to normal UI surfaces.

If I'm reading the code right, MAX_FDS_OUT is only a limit for a single
sendmsg() call. The fds_out buffer is capable of holding 1k fds.

We have 4kB message data buffer (what I've been calling the
soft-buffer, as opposed to in-kernel buffers). A minimal request to
carry an fd would be header (8B) plus fd (4B), so that's 12B. That
makes at most 341 fds until the soft-buffer is full. That should well
fit into the 1k fds that fds_out can hold.

One could have more fds in a single request. Let's say four. The
request size becomes 24B carrying 4 fds. That makes 680 fds for a full
soft-buffer. Should still be fine.

However, wl_connection_flush() attempts to write out at most
MAX_FDS_OUT fds at a time, but all of the message data. OTOH, one
cannot even try to flush out fds without any message data, IIRC
sendmsg() doesn't support that. So it can run out of message data while
fds accumulate and finally fill fds_out up, causing irrecoverable
failure.

Do your observations support this theory?

It seems we should somehow flush less message data so that we can call
sendmsg() enough times to handle all the fds and avoid them piling up.
But if we do so, we must be very careful with any potential assumptions
on receiving side on the ordering between message data and related fds.

> 
> Hmm, the patch I've uploaded does catch that, but if I remove the RETRY
> option I'll think I will need to add another special case
> wl_display_flush() check when creating buffers. In my case I would actually
> prefer to find out that the fd buffer is completely full before
> waiting/flushing though.
> 
> 
> > > + *
> > > + * If instead a send error strategy function is set, the core client  
> > library  
> > > + * will call it to allow the client to choose a different strategy.
> > > + *
> > > + * The function takes two arguments.  The first is the display context  
> > for the  
> > > + * error.  The second is the errno for the failure that occurred.
> > > + *
> > > + * The handler is expected to return one of the wl_send_error_strategy  
> > values  
> > > + * to indicate how the core client library should proceed.  
> >
> > This is the detail that bothers me: making the callback to determine
> > policy. You have no idea in what context that callback will be made.
> >  
> 
> Sure, but I meant this to be a choice available to the author of the larger
> client, assuming they had full knowledge of what the consequences of doing
> anything non-trivial were.

Even the author of the client would usually not know. It would make a
very hard API to use correctly.

> To me there only seems to be one real general pattern for the callback
> handler when might implement the retry strategy. It is just slightly more
> complex than the retry_with_poll_handler() from the test I added.
> 
> To sketch it out in quick pseudo-code, it would be:
> 
> 
> int typical_retry_handler(struct wl_display *display, int send_errno) {
> 
> if (send_errno == 0) { /* retry succeeded */
> 
> /* clear retry start timestamp */
> 
> }
> 
> if (send_errno != EAGAIN) return WL_SEND_ERROR_FATAL;
> if ( /* retry timestamp is not yet set */ ) {
> 
> /* set it */
> 
> }
> 
> if (now - timestamp > limit) return WL_SEND_ERROR_FATAL;
> 
> /* poll, using the timestamp to set a timeout */
> 
> return WL_SEND_ERROR_RETRY;
> 
> }
> 
> 
> If your objection is that you would rather this handler be standardized, I
> can move it to the the only patch, use the existing core client mutex, and
> having the timestamp be part of the wl_display state. Multiple threads
> would block for on the send, but they would have done so anyway. Then there
> could be a public api that selected between the default (abort), error, or
> error+retry implemented by the above pseudocode.

My objection is to the automatic blocking in general. We have a pretty
nicely asynchronous protocol and we try to teach programmers to write
their apps in a non-blocking way, so blocking inside libwayland-client
is not nice. We have very few functions that are documented as
potentially blocking. Extending that to all request sending functions
is not acceptable in my mind.

If blocking is a suitable solution for a flush failure, I'd rather give
the user the choice of where and when to block. Then there is no need
to set even a timeout policy in libwayland-client.

> 
> But as you said earlier, I will really instead have the wl_client_flush()
> do the poll().

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180619/548be718/attachment.sig>