[RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

Wed Aug 4 12:11:25 UTC 2021

On Tue, Aug 3, 2021 at 9:34 AM Michel Dänzer <michel at daenzer.net> wrote:
> On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote:
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- without making any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
> >>> submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
> >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > share these between the Guest and the Host. It does not seem to be causing any other
> > problems so far but we did try the experiment you mentioned (i.e., adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without having to tweak
> > anything (delays) because we can't do this for every compositor out there.
>
> Maybe there could be a mechanism which allows the compositor in the guest to automatically adjust its repaint cycle as needed.
>
> This might even be possible without requiring changes in each compositor, by adjusting the vertical blank periods in the guest to be aligned with the host compositor repaint cycles. Not sure about that though.
>
> Even if not, both this series or making it possible to queue multiple flips require corresponding changes in each compositor as well to have any effect.

Yeah from all the discussions and tests done it sounds even with a
deeper queue we have big coordination issues between the guest and
host compositor (like the example that the guest is now rendering at
90fps instead of 60fps like the host).

Hence my gut feeling reaction that first we need to get these two
compositors aligned in their timings, which propobably needs
consistent vblank periods/timestamps across them (plus/minux
guest/host clocksource fun ofc). Without this any of the next steps
will simply not work because there's too much jitter by the time the
guest compositor gets the flip completion events.

Once we have solid events I think we should look into statically
tuning guest/host compositor deadlines (like you've suggested in a
bunch of places) to consisently make that deadline and hit 60 fps.
With that we can then look into tuning this automatically and what to
do when e.g. switching between copying and zero-copy on the host side
(which might be needed in some cases) and how to handle all that.

Only when that all shows that we just can't hit 60fps consistently and
really need 3 buffers in flight should we look at deeper kms queues.
And then we really need to implement them properly and not with a
mismatch between drm_event an out-fence signalling. These quick hacks
are good for experiments, but there's a pile of other things we need
to do first. At least that's how I understand the problem here right
now.

Cheers, Daniel

>
>
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch