[RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

Kasireddy, Vivek vivek.kasireddy at intel.com
Mon Aug 2 06:51:33 UTC 2021


Hi Daniel,

> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
[Kasireddy, Vivek] Right, it wasn't a lot of work :)

> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
[Kasireddy, Vivek] Let me summarize the problem again from the perspective of
both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
Host compositor:
- After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms
and then submits the atomic commit and at the tail end of its cycle sends a frame callback
event to all its clients (who registered and submitted frames) indicating to them to 
start their next redraw  -- giving them at-least ~16 ms to submit a new frame to be
included in its next repaint. Why a configurable 9 ms delay is needed is explained
in Pekka's blog post here:
https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html

- It'll send a wl_buffer.release event for a client submitted previous buffer only
when the client has submitted a new buffer and:
a) When it hasn't started its repaint cycle yet OR
b) When it clears its old state after it gets a pageflip completion event -- if it had
flipped the client's buffer onto a hardware plane.

Guest compositor:
- After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the 
Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the
9 ms wait -- just like the Host compositor -- for its clients to submit new buffers.
- When it gets a pageflip completion, it assumes that the previously submitted buffer
is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum
of 4 backbuffers included as part of the Mesa GBM surface implementation.

Guest KMS/Virtio-gpu/Qemu Wayland UI:
- Because no_vblank=true for Guest KMS and since the vblank event (which also serves
as the pageflip completion event for user-space) is sent right away after atomic commit,
as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know for
sure that the Host is completely done using the buffer. To ensure this, we signal the dma-fence
only after the Host compositor sends a wl_buffer.release event or an equivalent signal.

The goal:
- Maintain full framerate even when the Guest scanout FB is flipped onto a hardware plane
on the Host -- regardless of either compositor's scheduling policy -- without making any
copies and ensuring that both Host and Guest are not accessing the buffer at the same time.

The problem:
- If the Host compositor flips the client's buffer (in this case Guest compositor's buffer) 
onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.

The solution:
- To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
gets a pageflip completion.
- The Weston MR I linked above does this by getting an out_fence fd and taking a reference
on all the FBs included in the atomic commit forcing the compositor to use new FBs for its
next repaint cycle. It releases the references when the out_fence is signalled later when
the Host compositor sends a wl_buffer.release event.

> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.
> 
> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
[Kasireddy, Vivek] Right, they both mean the same thing but I think using both
at the same time would be redundant in the case of Weston. That's why I am trying
to repurpose the usage of out_fence in this case by introducing a new capability
that may not be relevant for bare-metal KMS drivers but would be useful for
virtual KMS drivers.

> 
> We do cheat a bit right now in so far that we assume they're both the
> same, as in, panel-side latency is currently the compositor's problem to
> figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms
> driver from the host side. Currently that's not done, so this is most
> likely quite broken already (virtio relies on the no-vblank auto event
> sending, which definitely doesn't wait for anything, or I'm completely
> missing something).
[Kasireddy, Vivek] You are right; virtio_gpu does use the no_vblank auto event but
as I mentioned above we do use an internal dma-fence to wait until the submitted
buffer is no longer used by the Host. In other words, we wait (in update_planes hook)
until we get an appropriate signal from the Host to proceed to make sure that we are
not rendering faster than what the Host can display.

However, as you suggest below, we could set no_vblank=false and send the vblank/
pageflip completion event from the virtio-gpu driver instead of having the DRM
core send it. This can prevent the DRM core from signalling the out_fence as well
which is my intended objective and what my first patch tries to do. I'd still need the
new capability though to include the patch in Weston that deals with out_fence --
unless Weston upstream can accept the patch after reviewing it without this newly
added capability which would be redundant but it does solve my problem. Would
this be acceptable?

> 
> I think instead of hacking up some ill-defined 1.5 queue depth support,
> what we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.
> 
> It's a bit more work, but also a lot less hacking around infrastructure in
> dubious ways.
> 
> Thoughts?
[Kasireddy, Vivek] IIUC, you are suggesting that we should make it possible to
submit a new atomic commit even though the completion event for the previous
one has not come in yet. This may potentially solve my problem but it sounds very
disruptive and not very useful for bare-metal cases. It also means that the compositors,
DRM core and the drivers need to keep track of multiple states -- as opposed to new and
old -- for all objects such as crtcs, planes, etc and account for multiple completion events.
I guess it is doable but as you suggest it seems like a lot of work with many pitfalls ahead.

Thanks,
Vivek
> 
> Cheers, Daniel
> 
> >
> > Cc: Daniel Vetter <daniel at ffwll.ch>
> > Cc: Gerd Hoffmann <kraxel at redhat.com>
> > Cc: Pekka Paalanen <ppaalanen at gmail.com>
> > Cc: Simon Ser <contact at emersion.fr>
> > Cc: Michel Dänzer <michel at daenzer.net>
> > Cc: Tina Zhang <tina.zhang at intel.com>
> > Cc: Dongwon Kim <dongwon.kim at intel.com>
> >
> > Vivek Kasireddy (4):
> >   drm: Add a capability flag to support deferred out_fence signalling
> >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> >
> >  drivers/gpu/drm/drm_file.c               | 11 +++---
> >  drivers/gpu/drm/drm_ioctl.c              |  3 ++
> >  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
> >  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
> >  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
> >  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
> >  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
> >  include/drm/drm_mode_config.h            |  9 +++++
> >  include/uapi/drm/drm.h                   |  1 +
> >  include/uapi/linux/virtio_gpu.h          | 12 +++++++
> >  12 files changed, 117 insertions(+), 7 deletions(-)
> >
> > --
> > 2.30.2
> >
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


More information about the dri-devel mailing list