[PATCH 3/4] tests: add display test that highlights another bug in read_events

Wed Sep 10 02:17:44 PDT 2014

On Tue, 9 Sep 2014 13:10:48 +0200
Marek Chalupa <mchqwerty at gmail.com> wrote:

> On 4 September 2014 15:17, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > On Fri, 29 Aug 2014 11:21:30 +0200
> > Marek Chalupa <mchqwerty at gmail.com> wrote:
> >
> > > When wl_display_read_events() returns with errno == EAGAIN, we
> > > naturally try to call it again. But this next call results in
> > > deadlock.
> > >
> > > Signed-off-by: Marek Chalupa <mchqwerty at gmail.com>
> > > ---
> > >  tests/display-test.c | 88
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 88 insertions(+)
> > >
> > > diff --git a/tests/display-test.c b/tests/display-test.c
> > > index 1289866..e1cf830 100644
> > > --- a/tests/display-test.c
> > > +++ b/tests/display-test.c
> > > @@ -32,6 +32,7 @@
> > >  #include <sys/types.h>
> > >  #include <sys/stat.h>
> > >  #include <pthread.h>
> > > +#include <poll.h>
> > >
> > >  #include "wayland-private.h"
> > >  #include "wayland-server.h"
> > > @@ -480,6 +481,93 @@ TEST(threading_cancel_read_tst)
> > >       display_destroy(d);
> > >  }
> > >
> > > +/* set this to 1 when it is right that
> > > + * the thread is woken up */
> > > +static int threads_can_be_woken_up = 0;
> > > +
> > > +static void *
> > > +thread_eagain_func(void *data) {
> > > +     struct client *c = data;
> > > +
> > > +     register_reading(c->wl_display);
> > > +
> > > +     c->display_stopped = 1;
> > > +     assert(wl_display_read_events(c->wl_display) == 0
> > > +             && errno != EAGAIN);
> > > +     /* if thread was sleeping in read_events, this assert
> > > +      * must pass now */
> > > +     assert(threads_can_be_woken_up && "Thread woke up too
> > > early"); +
> > > +     pthread_exit(NULL);
> > > +}
> > > +
> > > +static void
> > > +threading_read_eagain(void)
> > > +{
> > > +     struct client *c = client_connect();
> > > +     pthread_t th1, th2, th3;
> > > +     struct pollfd fds;
> > > +     int i;
> > > +
> > > +     register_reading(c->wl_display);
> > > +
> > > +     th1 = create_thread(c, thread_eagain_func);
> > > +     th2 = create_thread(c, thread_eagain_func);
> > > +     th3 = create_thread(c, thread_eagain_func);
> > > +
> > > +     /* All the threads are sleeping, waiting until read or
> > > cancel
> > > +      * is called. Since we have no data on socket waiting,
> > > +      * the wl_connection_read should end up with error and set
> > > errno
> > > +      * to EAGAIN => wl_display_read_events() will return 0 and
> > > set
> > errno
> > > +      * to EAGAIN. It should not wake up the sleeping threads,
> > > so that
> > > +      * this thread can try to read again */
> > > +     alarm(3);
> > > +     assert(wl_display_read_events(c->wl_display) == 0);
> > > +     assert(errno == EAGAIN);
> >
> > I think there is a problem here. wl_display_read_events() returning
> > 0 means success, and in that case errno may not be set if it was a
> > real success. There is no EAGAIN to be handled, it's already done
> > inside read_events().
> >
> > In this test it won't be a real success, but you cannot code
> > clients to check for errno after getting 0 from
> > wl_display_read_events().
> >
> 
> > > +
> > > +     /* the same should happen next times - but there was a bug
> > > when
> > > +      * the main thread got blocked! */
> > > +     for (i = 0; i < 5; ++i) {
> > > +             assert(wl_display_read_events(c->wl_display) == 0);
> >
> > Therefore you are here calling wl_display_read_events() again after
> > it succeeded! And that IMO is a developer error.
> >
> 
> yes
> 
> 
> > Besides, I cannot see anywhere where it says that it is ok to call
> > wl_display_read_events() again even if it failed, without calling
> > wl_display_prepare_read() first.
> >
> >
> Since original piece of code returned 0 on EAGAIN, then I suppose the
> author considered this as a success, so yes, wl_display_read_events()
> is not probably meant
> to be called again. But consider this:
> if the thread that got EAGAIN (the only thread that is not sleeping)
> calls wl_display_prepare_read()
> + wl_display_read_events() again, it works. Problem is that it does
> not have to call it
> right after the failed wl_display_read_events and the other threads
> are still sleeping.

EAGAIN is not a failure here. It is only a spurious wakeup, and all
threads should go back sleeping on whatever they need to sleep on,
usually poll().

> Moreover, if there are not events on the socket for some time, this
> one thread is looping in cycle
> while the other threads are sleeping.

It's not a busy-loop, because the thread would be sleeping in poll() or
equivalent.

But yes, I see that the other threads sleeping inside their
read_events() are not woken up, so they do not do a cycle. I'm not sure
if that is a problem or not. I suppose waking everyone up would not
hurt, but not waking up would risk deadlocking if the thread with
EAGAIN never read again.

> What I meant to do was to give the programmer the choice whether to
> loop or let
> all threads try it again (cancel read). But as you said before, 0 is
> wrong return value for that.
> See end of the e-mail for conclusions.

Yeah, it seems to be wrong to require cancel_read() after calling
read_events(), no matter whether it succeeded or not.

> Summing it up, in my point of view we have these options:
> 
> * Let it as it is:
>      this can lead to starving of other threads for some time.
> Moreover, if the thread that
>      got EAGAIN decides to exit (or simply not to read again) instead
> of another trial to read, the other threads will block.

Yup, I agree with the analysis here.

> * return -1 and set EAGAIN
>     programmer can either try to read again, or cancel read and let
> all threads do another cycle. Good thing is
>     that programmer knows that there was an error and can cancel read
> in the case that the thread is exiting
>     instead of another cycle.

EAGAIN is not an error, which would make this solution an ABI break.

>  * wake up threads on EAGAIN
>     this solution will force all threads to do another cycle (or
> whatever they want) and should not have any side effects. It can lead
> to big overhead if there are no events on the socket for some time
> (waking up threads again and again in cycle)

I think we should do this. I don't really see the overhead at all,
since every thread should have poll() in its loop, so they will not be
busy-looping.

After all, the sleep inside read_events() is only for synchronizing
reading threads, not for waiting new data to appear.

EAGAIN in read_events() is a success, even though there was nothing
received. On success (and on failure!) all threads sleeping inside
read_events() should be woken up. IOW, whatever thread hits the
reader_count==0 case in read_events() should *always* end up calling
display_wakeup_threads() before returning.

Does this make sense?

In fact, I think wl_display_read_event() failing on last_error should
be equivalent to calling wl_display_cancel_read(). It would be the same
if the app manually first checked for error and then canceled.

Thanks,
pq