[PATCH 3/4] tests: add display test that highlights another bug in read_events

Tue Sep 9 06:48:13 PDT 2014

On Tue, 9 Sep 2014 13:10:48 +0200
Marek Chalupa <mchqwerty at gmail.com> wrote:

> On 4 September 2014 15:17, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > On Fri, 29 Aug 2014 11:21:30 +0200
> > Marek Chalupa <mchqwerty at gmail.com> wrote:
> >
> > > When wl_display_read_events() returns with errno == EAGAIN, we
> > > naturally try to call it again. But this next call results in
> > > deadlock.
> > >
> > > Signed-off-by: Marek Chalupa <mchqwerty at gmail.com>
> > > ---
> > >  tests/display-test.c | 88
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 88 insertions(+)
> > >
> > > diff --git a/tests/display-test.c b/tests/display-test.c
> > > index 1289866..e1cf830 100644
> > > --- a/tests/display-test.c
> > > +++ b/tests/display-test.c
> > > @@ -32,6 +32,7 @@
> > >  #include <sys/types.h>
> > >  #include <sys/stat.h>
> > >  #include <pthread.h>
> > > +#include <poll.h>
> > >
> > >  #include "wayland-private.h"
> > >  #include "wayland-server.h"
> > > @@ -480,6 +481,93 @@ TEST(threading_cancel_read_tst)
> > >       display_destroy(d);
> > >  }
> > >
> > > +/* set this to 1 when it is right that
> > > + * the thread is woken up */
> > > +static int threads_can_be_woken_up = 0;
> > > +
> > > +static void *
> > > +thread_eagain_func(void *data) {
> > > +     struct client *c = data;
> > > +
> > > +     register_reading(c->wl_display);
> > > +
> > > +     c->display_stopped = 1;
> > > +     assert(wl_display_read_events(c->wl_display) == 0
> > > +             && errno != EAGAIN);
> > > +     /* if thread was sleeping in read_events, this assert
> > > +      * must pass now */
> > > +     assert(threads_can_be_woken_up && "Thread woke up too
> > > early"); +
> > > +     pthread_exit(NULL);
> > > +}
> > > +
> > > +static void
> > > +threading_read_eagain(void)
> > > +{
> > > +     struct client *c = client_connect();
> > > +     pthread_t th1, th2, th3;
> > > +     struct pollfd fds;
> > > +     int i;
> > > +
> > > +     register_reading(c->wl_display);
> > > +
> > > +     th1 = create_thread(c, thread_eagain_func);
> > > +     th2 = create_thread(c, thread_eagain_func);
> > > +     th3 = create_thread(c, thread_eagain_func);
> > > +
> > > +     /* All the threads are sleeping, waiting until read or
> > > cancel
> > > +      * is called. Since we have no data on socket waiting,
> > > +      * the wl_connection_read should end up with error and set
> > > errno
> > > +      * to EAGAIN => wl_display_read_events() will return 0 and
> > > set
> > errno
> > > +      * to EAGAIN. It should not wake up the sleeping threads,
> > > so that
> > > +      * this thread can try to read again */
> > > +     alarm(3);
> > > +     assert(wl_display_read_events(c->wl_display) == 0);
> > > +     assert(errno == EAGAIN);
> >
> > I think there is a problem here. wl_display_read_events() returning
> > 0 means success, and in that case errno may not be set if it was a
> > real success. There is no EAGAIN to be handled, it's already done
> > inside read_events().
> >
> > In this test it won't be a real success, but you cannot code
> > clients to check for errno after getting 0 from
> > wl_display_read_events().
> >
> 
> > > +
> > > +     /* the same should happen next times - but there was a bug
> > > when
> > > +      * the main thread got blocked! */
> > > +     for (i = 0; i < 5; ++i) {
> > > +             assert(wl_display_read_events(c->wl_display) == 0);
> >
> > Therefore you are here calling wl_display_read_events() again after
> > it succeeded! And that IMO is a developer error.
> >
> 
> yes
> 
> 
> > Besides, I cannot see anywhere where it says that it is ok to call
> > wl_display_read_events() again even if it failed, without calling
> > wl_display_prepare_read() first.
> >
> >
> Since original piece of code returned 0 on EAGAIN, then I suppose the
> author considered this as a success, so yes, wl_display_read_events()
> is not probably meant
> to be called again. But consider this:
> if the thread that got EAGAIN (the only thread that is not sleeping)
> calls wl_display_prepare_read()
> + wl_display_read_events() again, it works. Problem is that it does
> not have to call it
> right after the failed wl_display_read_events and the other threads
> are still sleeping.
> Moreover, if there are not events on the socket for some time, this
> one thread is looping in cycle
> while the other threads are sleeping.
> 
> What I meant to do was to give the programmer the choice whether to
> loop or let
> all threads try it again (cancel read). But as you said before, 0 is
> wrong return value for that.
> See end of the e-mail for conclusions.
> 
> 
> > But I see something else: wl_display_read_events() does not
> > decrement reader_count if it returns due to last_error. I think
> > that needs a fix to guarantee the reader_count is decremented, so
> > that the side-effects are consistent.
> >
> 
> Yes, we should fix this. There are cases that can cause a problem.
> 
> 
> > > +             assert(errno == EAGAIN);
> > > +     }
> > > +
> > > +     /* generate something to read and wait until the event
> > > comes */
> > > +     wl_proxy_marshal((struct wl_proxy *) c->tc,
> > > +                      0 /* STOP DISPLAY */, 1 /* number */);
> > > +     assert(wl_display_flush(c->wl_display) > 0);
> > > +
> > > +     fds.fd = wl_display_get_fd(c->wl_display);
> > > +     fds.events = POLLIN;
> > > +     assert(poll(&fds, 1, 3) == 1);
> >
> > Is timeout of 3 milliseconds intended?
> >
> 
> Nope, my bad.
> 
> 
> >
> > > +
> > > +     /* threads can be woken up now */
> > > +     threads_can_be_woken_up = 1;
> > > +
> > > +     errno = 0;
> > > +     assert(wl_display_read_events(c->wl_display) == 0);
> > > +     assert(errno != EAGAIN);
> > > +     assert(wl_display_dispatch_pending(c->wl_display) == 1);
> > > +
> > > +     pthread_join(th1, NULL);
> > > +     pthread_join(th2, NULL);
> > > +     pthread_join(th3, NULL);
> > > +
> > > +     client_disconnect(c);
> > > +}
> > > +
> > > +TEST(threading_read_eagain_tst)
> > > +{
> > > +     struct display *d = display_create();
> > > +
> > > +     client_create(d, threading_read_eagain);
> > > +
> > > +     display_run(d);
> > > +     display_resume(d);
> > > +
> > > +     display_destroy(d);
> > > +}
> > > +
> > >  static void *
> > >  thread_prepare_and_read2(void *data)
> > >  {
> >
> >
> > Thanks,
> > pq
> >
> 
> First, thanks for review.
> 
> Summing it up, in my point of view we have these options:
> 
> * Let it as it is:
>      this can lead to starving of other threads for some time.
> Moreover, if the thread that
>      got EAGAIN decides to exit (or simply not to read again) instead
> of another trial to read, the other threads will block.
> * return -1 and set EAGAIN
>     programmer can either try to read again, or cancel read and let
> all threads do another cycle. Good thing is
>     that programmer knows that there was an error and can cancel read
> in the case that the thread is exiting
>     instead of another cycle.
>  * wake up threads on EAGAIN
>     this solution will force all threads to do another cycle (or
> whatever they want) and should not have any side effects. It can lead
> to big overhead if there are no events on the socket for some time
> (waking up threads again and again in cycle)
> 
> + fix the bug with last_error, of course
> 
> Is there something that I'm missing?

Hi,

just a quick comment, since I didn't get to think it all through yet.

I believe wl_display_read_events() encountering EAGAIN is not an error.
The socket fd is non-blocking (hmm, is it?), so EAGAIN is a normal
condition when there is nothing to read. The program is using poll() to
attempt reading only when there is/was something, and otherwise sleep
in poll(). There should not be any busy-spinning.

A good question is, should this work for blocking fds and without
poll()? I'm not even sure libwayland-client gives out blocking fds, but
OTOH I vaguely recall some documentation about blocking.

window.c main event loop in display_run() uses epoll. simple-shm just
calls wl_display_dispatch() in a loop doing nothing else. Maybe these
were the only two use cases supported?

Thanks,
pq