How to design a DRM KMS driver exposing 2D compositing?

Mon Aug 11 10:27:45 PDT 2014

On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> Daniel Vetter <daniel at ffwll.ch> writes:
> 
> > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> Hi,
> >> 
> >> there is some hardware than can do 2D compositing with an arbitrary
> >> number of planes. I'm not sure what the absolute maximum number of
> >> planes is, but for the discussion, let's say it is 100.
> >> 
> >> There are many complicated, dynamic constraints on how many, what size,
> >> etc. planes can be used at once. A driver would be able to check those
> >> before kicking the 2D compositing engine.
> >> 
> >> The 2D compositing engine in the best case (only few planes used) is
> >> able to composite on the fly in scanout, just like the usual overlay
> >> hardware blocks in CRTCs. When the composition complexity goes up, the
> >> driver can fall back to compositing into a buffer rather than on the
> >> fly in scanout. This fallback needs to be completely transparent to the
> >> user space, implying only additional latency if anything.
> >> 
> >> These 2D compositing features should be exposed to user space through a
> >> standard kernel ABI, hopefully an existing ABI in the very near future
> >> like the KMS atomic.
> >
> > I presume we're talking about the video core from raspi? Or at least
> > something similar?
> 
> Pekka wasn't sure if things were confidential here, but I can say it:
> Yeah, it's the RPi.
> 
> While I haven't written code using the compositor interface (I just did
> enough to shim in a single plane for bringup, and I'm hoping Pekka and
> company can handle the rest for me :) ), my understanding is that the
> way you make use of it is that you've got your previous frame loaded up
> in the HVS (the plane compositor hardware), then when you're asked to
> put up a new frame that's going to be too hard, you take some
> complicated chunk of your scene and ask the HVS to use any spare
> bandwidth it has while it's still scanning out the previous frame in
> order to composite that piece of new scene into memory.  Then, when it's
> done with the offline composite, you ask the HVS to do the next scanout
> frame using the original scene with the pre-composited temporary buffer.
> 
> I'm pretty comfortable with the idea of having some large number of
> planes preallocated, and deciding that "nobody could possibly need more
> than 16" (or whatever).
> 
> My initial reaction to "we should just punt when we run out of bandwidth
> and have a special driver interface for offline composite" was "that's
> awful, when the kernel could just get the job done immediately, and
> easily, and it would know exactly what it needed to composite to get
> things to fit (unlike userspace)".  I'm trying to come up with what
> benefit there would be to having a separate interface for offline
> composite.  I've got 3 things:
> 
> - Avoids having a potentially long, interruptible wait in the modeset
>   path while the offline composite happens.  But I think we have other
>   interruptible waits in that path alreaady.
> 
> - Userspace could potentially do something else besides use the HVS to
>   get the fallback done.  Video would have to use the HVS, to get the
>   same scaling filters applied as the previous frame where things *did*
>   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>   which would have more BW available to it than what you're borrowing
>   from the previous frame's HVS capacity.
> 
> - Userspace could potentially use the offline composite interface for
>   things besides just the running-out-of-bandwidth case.  Like, it was
>   doing a nicely-filtered downscale of an overlaid video, then the user
>   hit pause and walked away: you could have a timeout that noticed that
>   the complicated scene hadn't changed in a while, and you'd drop from
>   overlays to a HVS-composited single plane to reduce power.
> 
> The third one is the one I've actually found kind of compelling, and
> might be switching me from wanting no userspace visibility into the
> fallback.  But I don't have a good feel for how much complexity there is
> to our descriptions of planes, and how much poorly-tested interface we'd
> be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't
change any more bake the entire scene into a single framebuffer. The exact
same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the
overlay shouldn't be stopped ofc. But imo there's nothing special here for
the rpi.

> (Because, honestly, I don't expect the fallbacks to be hit much -- my
> understanding of the bandwidth equation is that you're mostly counting
> the number of pixels that have to be read, and clipped-out pixels
> because somebody's overlaid on top of you don't count unless they're in
> the same burst read.  So unless people are going nuts with blending in
> overlays, or downscaled video, it's probably not a problem, and
> something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw"
case just never happens then it's pointless to write special code for it.
And we can always add a limit later for the case where GL is usually
better and tell userspace that we can't do this many planes. Exact same
thing with running out of memory bw can happen anywhere else, too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch