Wayland fd watching

Thu Jan 13 09:42:54 UTC 2022

On Thu, 13 Jan 2022 10:05:56 +0200
Maksim Sisov <msisov at igalia.com> wrote:

> + wayland-devel ML.
> 
> Hi Pekka, 
> 
> Thanks for your answers! Please see my questions inlined. 
> 
> On 2022-01-12 16:16, Pekka Paalanen wrote:
> 
> > On Mon, 10 Jan 2022 08:53:50 +0200
> > Maksim Sisov <msisov at igalia.com> wrote:
> >   
> >> Hi Pekka, 
> >> 
> >> Thanks for answering all my previous questions before.  
> > 
> > Hi Maksim,
> > 
> > looks like that was a year ago. I had forgot. :-)   
> 
> Time flies!
> 
> >   
> >> I came up with a new question and wondered what was your opinion. 
> >> 
> >> In Chromium, we have two modes - one is normal when a GPU service runs
> >> in a separate sandboxed process and uses surfaceless path with libgbm
> >> (we pass dmabuf to the browser process where Wayland connection is...),
> >> and another one is --in-process-gpu when the GPU service runs in the
> >> same browser process, but in a different thread. That also uses
> >> surfaceless path by default. 
> >> 
> >> However, when either libgbm is not available or a system in question
> >> doesn't support drm render nodes, the browser may fall back to use
> >> Wayland EGL. However, it is only used if the browser runs with
> >> --in-process-gpu flag as we need to access a wl_surface somehow. That
> >> means the GPU service on a separate thread that use Wayland EGL and the
> >> main UI thread that actually does all the UI stuff, manages the
> >> connection, etc. 
> >> 
> >> Long time ago, we figured out there was a problem with the
> >> --in-process-gpu + Wayland EGL. But given both Chromium as a Wayland
> >> client and Wayland EGL that is another client prepare/read/dispatch
> >> events (each own event queue, of course), we started to experience
> >> deadlocks.  
> > 
> > Surely they are the same client (connection)?  
> 
> Yes, they are.
> 
> >   
> >> The deadlock happened when Chromium was closing a native window and our
> >> event loop had already prepared to read for the display, but it then
> >> didn't read as the thread was closing the GPU thread, which was calling
> >> eglSwapBuffers, which internally also called prepare and then read, but
> >> it didn't read, but rather was blocked as another thread already
> >> prepared to read, but didn't read either. So, the UI thread could make a
> >> blocking wait until the GPU thread is done doing its stuff before
> >> tearing it down. 
> >> 
> >> As you can see from my description, the UI thread prepared to read the
> >> wayland display, then switched to tearing down the GPU thread, which was
> >> busy with processing our gl commands, and it had to wait until its done.
> >> The GPU thread used Wayland EGL, which also prepared to read and the
> >> read call was blocked because there were other clients willing to read.  
> > 
> > I see. Yes, there is the assumption that if a thread prepares to read,
> > then it will either read or cancel the read in finite time (ASAP).
> >   
> >> To overcome this problem, it was decided to use a separate polling
> >> thread. What it does now is that the thread prepares/polls/reads (we use
> >> libevent or libglib, btw. this decision is based on how chromium is
> >> configured), while the UI thread calls dispatch events whenever our
> >> "polling" thread asks. This way, the polling thread can always call
> >> wl_display_read_events and the deadlock doesn't ever happen.  
> > 
> > Right, I can see that working.
> >   
> >> I wonder if this is the right way to do so. I also wonder if doing this
> >> stuff on a separate thread gives any performance benefits?  
> > 
> > It works, but it's not a "proposed" design in my opinion. Doing
> > all of prepare/read/dispatch in each thread is the "intended" usage.
> > 
> > I think the core of the problem is the thread that prepares to read and
> > then doesn't read but goes to do something else. That is basically a
> > violation of the API contract of libwayland-client. Unfortunately in
> > this case it leads to a solid deadlock. The documentation is very clear
> > about it:
> > 
> > * This function (or wl_display_prepare_read()) must be called before reading
> > * from the file descriptor using wl_display_read_events(). Calling
> > * wl_display_prepare_read_queue() announces the calling thread's intention to
> > * read and ensures that until the thread is ready to read and calls
> > * wl_display_read_events(), no other thread will read from the file descriptor.
> > 
> > You could also call wl_display_cancel_read() instead:
> > 
> > * After a thread successfully called wl_display_prepare_read() it must
> > * either call wl_display_read_events() or wl_display_cancel_read().
> > * If the threads do not follow this rule it will lead to deadlock.
> > 
> > But if you cancel, then you must not read without preparing again first.
> > 
> > The separate polling thread does not sound bad though. It is a correct
> > design. Now that I think of it, wl_display_read_events() has a
> > pthread_cond_wait() in it that blocks all other reading threads until
> > the last reading thread actually calls read (or cancel). The separate
> > polling thread, like you described it, might avoid some of that
> > blocking.
> > 
> > Perhaps the polling thread design is better when it is possible. You
> > can use it in your own code, but it's not really possible for things
> > like EGL implementations since EGL does not define an API for
> > cross-thread wake-ups.  
> 
> I see. The only only problem here is that we don't use poll, but rather
> libevent, which is wrapped around Chromium's MessagePumpLibevent. It
> constantly notifies us that a fd is ready to be read.

You say "constantly", do you mean it never waits?

> Here is a diagram that shows the flow without a dedicated thread -
> https://drive.google.com/file/d/1Jq8aX4cWszwL5fleE8ayGOeond70xChS/view?usp=sharing
> .
> 
> As you can see, it returns from OnCanRead and waits until libevent
> notifies us the next time that we can actually read events (if prepared
> to read previously). And if the main thread blocks because of the above
> mentioned reason (libevent didn't have a chance to notify us), the
> events will never be read.

Are you describing the old approach or the new approach with the
separate polling thread?

Doing prepare_read from OnCanRead handler seems a bit backwards. The
prepare_read dance is to make sure that when you are going to wait for
new events (block), you have processed all incoming events so far and
flushed out all requests that might have resulted from those, so that
if you are waiting for replies to those requests, they will actually
arrive. So the prepare_read/dispatch/flush dance should be done from a
"the event loop is going to sleep now" hook, not from "the fd has bytes
to read" hook.

Hmm, yeah... I suppose when your event loop decides to do something
else than actually read the Wayland socket after it was prepared for
reading, and that something else is a blocking thing, you would have to
cancel the read first just in case, and then prepare_read again to be
able to read afterwards.

Maybe others here have better ideas, or know libevent or glib better?
I recall people having written integrations with those event loop
libraries, but I can't remember who and where.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20220113/94403262/attachment-0001.sig>