Shared atomic state causing Weston repaint failure

Daniel Vetter daniel at ffwll.ch
Thu Jul 5 15:05:53 UTC 2018


On Thu, Jul 05, 2018 at 02:32:42PM +0100, Jakob Bornecrantz wrote:
> Hello DanielĀ² et al,
> 
> I apologies in advance if the things I bring up are a bit orthogonal
> to the current discussion. And I'm writing this from the view of VR,
> but I have not written a VR compositor so take my comments with a bit
> of salt. And also changing modes while using a VR headset is probably
> going to be ultra rare so not sure how much effort it is worth.

We actually just had a lenghty discussion on irc exactly about how this
problem is even more fun with drm leases, where the lessor and lessee
cannot collaborate really on the resource sharing.

> So from a VR compositor getting blocked like this is a no-go as the
> user would quickly throw EPUKE. The situation is compounded by the
> fact that the VR compositor has no idea what the display compositor is
> doing with regards to setting modes so can not do any mitigating on
> its side (like displaying a black screen).
> 
> Some solutions that springs to mind (some I admit are probably not possible).
> 
> - Make sure we don't get into this situation by locking the resources
> of the VR crtc group or allocating enough bandwidth for the display
> compositor crtcs up front.
 
I don't think up-front reserving of resources is feasible, it would be
really fragile driver code. But we can detect when it happens, and allow
the VR compositor to fall back to black (which I think mostly avoids the
EPUKE) and negotiate a better lease with the compositor. Or at least do
the modeset while the user isn't looking through the HMD.

> - Add priority and preemption to atomic so that VR compositor can
> never be blocked.

This is a bit tricky, and worst case it means we just end up with deep
queues, which also results in EPUKE in the VR use case.

One thing we could do is enforce that lessees cannot affected a CRTC not
in their lease, so preventing them from creating havoc in other leases (or
the main compositor).

The main compositor is already allowed to revoke the lease whenever it
sees fit, but I guess we should give it an option to not wreak with leases
it handed out (except when it wants to revoke them anyway).

We're also agreeing that we need the kernel to tell both sides affected by
this what's going on, so they can take appropriate actions (like just show
a black frame) when they get pulled over the table by the other part.

> - Add X/Wayland protocol for the compositor to tell the VR compositor
> that a modeset might effect it, so it can display a black-screen
> during that time.
> 
> - Make it possible for the VR compositor to tell the kernel what it
> should do this case, like show black if I happen to get block before I
> can queue a new pageflip.

I don't think we need protocol beyond what we have right now for leases
(or at least not much). Consensus from IRC seems to be that as long as the
uapi work tells both sides what's going on it should be good enough. The
compositor can then choose to not pull the VR lessee over the table (or
outright revoke the lease if it really needs to do the modeset, and then
start over with everything).
-Daniel

> 
> Cheers, Jakob.
> On Wed, Jul 4, 2018 at 8:58 PM Daniel Vetter <daniel at ffwll.ch> wrote:
> >
> > On Wed, Jul 4, 2018 at 5:44 PM, Daniel Stone <daniel at fooishbar.org> wrote:
> > > Hi,
> > > The atomic API being super-explicit about how userspace sequences its
> > > calls is great and all, but having shared global state implicitly
> > > dragged in is kind of ruining my day.
> > >
> > > Currently on Intel, Weston sometimes fails on hotplug, because a
> > > commit which only enables CRTC B (not touching CRTC A or any other
> > > CRTC!), causes all commits to CRTC A to fail until CRTC B's modeset
> > > commit has fully retired:
> > >     https://gitlab.freedesktop.org/wayland/weston/issues/24
> > >
> > > The reason is that committing CRTC B resizes the DDB allocation for
> > > CRTC A as well, pulling CRTC A's CRTC state into the commit. This
> > > makes sense, but on the other hand it's totally opaque to userspace,
> > > and impossible for us to reason about when making commits.
> > >
> > > I suggested some options in that GitLab commit, none of which I like:
> > >   * if any other CRTCs are pulled into a commit state, always execute
> > > a blocking commit in the kernel
> > >   * if we're passing ALLOW_MODESET in userspace, only ever do blocking commits
> > >   * whenever we get -EBUSY in userspace, assume we've been screwed by
> > > the kernel and defer until other outputs have completed
> > >   * whenever we want to reconfigure any output in userspace, wait
> > > until all outputs are completely quiescent and do a single atomic
> > > commit covering all outputs
> > >
> > > The first one seems completely non-obvious from the kernel, but on the
> > > other hand the current -EBUSY failing behaviour is also non-obvious.
> > >
> > > The second is maybe the most reasonable, but on the other hand just
> > > working around a painful leaky abstraction: we also can't know upfront
> > > from userspace if this is actually going to be required, or if we're
> > > just killing responsiveness blocking for no reason.
> > >
> > > The third is the thing I least want to do, because it might well paper
> > > over legitimate bugs in userspace, and complicates our state tracking
> > > for no reason.
> > >
> > > The fourth is probably the most legitimate, but, well ... someone has
> > > to type up all the code to make our output-configuration API
> > > completely asynchronous.
> > >
> > > I suspect we're the first ones to be hitting this, because Weston has
> > > a truly independent per-CRTC repaint loop, we're one of the few atomic
> > > users, and also because Pekka did some seriously brutal hotplug
> > > testing whilst reworking Weston's output configuration API. Also
> > > because our approach to failed output repaints is to just freeze the
> > > output until it next cycles off and on, which is much more apparent
> > > than just silently dropping a frame here and there. ;)
> > >
> > > Any bright ideas on what could practically be done here?
> > >
> > > Cheers,
> > > Daniel
> >
> > We had an entirely inconclusive chat on irc about this topic. Random notes:
> >
> > - This can fail both on the commit doing a modeset (when it adds some
> > random other CRTC which still has a pending flip), and for normal
> > flips (if they run into a CRTC which has been affected by a modeset).
> >
> > - Just waiting for all nonblocking modesets to complete is not enough,
> > because you don't get an event for these affected CRTC. And they might
> > take longer. So you can still get an EBUSY even when it looks like
> > everything is done.
> >
> > - If you don't combine nonblocking with ALLOW_MODESET there's no
> > issue, because everyone just blocks (and maybe misses a few frames)
> > until the modeset is done.
> >
> > - We want to tell apart an EBUSY due to the redraw loop having lost
> > sync from an EBUSY due to these modesets that affect more than just
> > the passed-in CRTCs. It's a really nice debug feature for compositor
> > developers.
> >
> > -> One way or another we need new/clarified uapi, atomic as
> > implemented doesn't really cut it.
> >
> > A few of the options we've discussed:
> >
> > - Eating all the EBUSY for ALLOW_MODESET. A bit a bigger hammer.
> > Slightly smaller hammer is just eating all the EBUSY for crtc not in
> > the original request. It would still be nice to tell userspace about
> > the additional CRTCs its atomic comit affects.
> >
> > - Some new flags to send out events (what about fence_ptr?) for these
> > affected crtcs, plus reporting them out in a new affected_crtcs_mask
> > in the atomic ioctl struct.
> >
> > - One tricky issue is that drivers might pull in CRTCs which are
> > disabled and stay disabled. Events aren't well-defined on these at
> > all, but they might still provoke an EBUSY!
> >
> > - Compositors would like to know how many frames they've lost. One
> > idea was to split up the retry error codes: EINTR would mean retrying
> > immedetialy due to a signal. EAGAIN would mean the kernel had to
> > insert something to make a modeset possible, retry only after you
> > waited for one vblank. With current userspace this would only result
> > in cpu wasted, but updated userspace could be more intelligent about
> > things. We could also do a special flag or whatever.
> >
> > We didn't zero in on anything specific since it's not really clear
> > what compositors want here at all.
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the dri-devel mailing list