How to design a DRM KMS driver exposing 2D compositing?

Tue Aug 12 01:48:13 PDT 2014

On Mon, 11 Aug 2014 19:27:45 +0200
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> > Daniel Vetter <daniel at ffwll.ch> writes:
> > 
> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > >> Hi,
> > >> 
> > >> there is some hardware than can do 2D compositing with an arbitrary
> > >> number of planes. I'm not sure what the absolute maximum number of
> > >> planes is, but for the discussion, let's say it is 100.
> > >> 
> > >> There are many complicated, dynamic constraints on how many, what size,
> > >> etc. planes can be used at once. A driver would be able to check those
> > >> before kicking the 2D compositing engine.
> > >> 
> > >> The 2D compositing engine in the best case (only few planes used) is
> > >> able to composite on the fly in scanout, just like the usual overlay
> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
> > >> driver can fall back to compositing into a buffer rather than on the
> > >> fly in scanout. This fallback needs to be completely transparent to the
> > >> user space, implying only additional latency if anything.
> > >> 
> > >> These 2D compositing features should be exposed to user space through a
> > >> standard kernel ABI, hopefully an existing ABI in the very near future
> > >> like the KMS atomic.
> > >
> > > I presume we're talking about the video core from raspi? Or at least
> > > something similar?
> > 
> > Pekka wasn't sure if things were confidential here, but I can say it:
> > Yeah, it's the RPi.
> > 
> > While I haven't written code using the compositor interface (I just did
> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
> > company can handle the rest for me :) ), my understanding is that the
> > way you make use of it is that you've got your previous frame loaded up
> > in the HVS (the plane compositor hardware), then when you're asked to
> > put up a new frame that's going to be too hard, you take some
> > complicated chunk of your scene and ask the HVS to use any spare
> > bandwidth it has while it's still scanning out the previous frame in
> > order to composite that piece of new scene into memory.  Then, when it's
> > done with the offline composite, you ask the HVS to do the next scanout
> > frame using the original scene with the pre-composited temporary buffer.
> > 
> > I'm pretty comfortable with the idea of having some large number of
> > planes preallocated, and deciding that "nobody could possibly need more
> > than 16" (or whatever).
> > 
> > My initial reaction to "we should just punt when we run out of bandwidth
> > and have a special driver interface for offline composite" was "that's
> > awful, when the kernel could just get the job done immediately, and
> > easily, and it would know exactly what it needed to composite to get
> > things to fit (unlike userspace)".  I'm trying to come up with what
> > benefit there would be to having a separate interface for offline
> > composite.  I've got 3 things:
> > 
> > - Avoids having a potentially long, interruptible wait in the modeset
> >   path while the offline composite happens.  But I think we have other
> >   interruptible waits in that path alreaady.
> > 
> > - Userspace could potentially do something else besides use the HVS to
> >   get the fallback done.  Video would have to use the HVS, to get the
> >   same scaling filters applied as the previous frame where things *did*
> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
> >   which would have more BW available to it than what you're borrowing
> >   from the previous frame's HVS capacity.
> > 
> > - Userspace could potentially use the offline composite interface for
> >   things besides just the running-out-of-bandwidth case.  Like, it was
> >   doing a nicely-filtered downscale of an overlaid video, then the user
> >   hit pause and walked away: you could have a timeout that noticed that
> >   the complicated scene hadn't changed in a while, and you'd drop from
> >   overlays to a HVS-composited single plane to reduce power.
> > 
> > The third one is the one I've actually found kind of compelling, and
> > might be switching me from wanting no userspace visibility into the
> > fallback.  But I don't have a good feel for how much complexity there is
> > to our descriptions of planes, and how much poorly-tested interface we'd
> > be adding to support this usecase.
> 
> Compositor should already do a rough bw guesstimate and if stuff doesn't
> change any more bake the entire scene into a single framebuffer. The exact
> same issue happens on more usual hw with video overlays, too.
> 
> Ofc if it turns out that scanning out your yuv planes is less bw then the
> overlay shouldn't be stopped ofc. But imo there's nothing special here for
> the rpi.
>  
> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
> > understanding of the bandwidth equation is that you're mostly counting
> > the number of pixels that have to be read, and clipped-out pixels
> > because somebody's overlaid on top of you don't count unless they're in
> > the same burst read.  So unless people are going nuts with blending in
> > overlays, or downscaled video, it's probably not a problem, and
> > something that gets your pixels on the screen at all is sufficient)
> 
> Yeah I guess we need to check reality here. If the "we've run out of bw"
> case just never happens then it's pointless to write special code for it.
> And we can always add a limit later for the case where GL is usually
> better and tell userspace that we can't do this many planes. Exact same
> thing with running out of memory bw can happen anywhere else, too.

I had a chat with Eric last night, and our different views about the
on-line/real-time performance limits of the HVS seem to be due to alpha
blending.

Eric has not been using alpha blending much or at all, while my
experiments with Weston and DispmanX pretty much always need alpha
blending (e.g. because DispmanX cannot say that only a sub-region of a
buffer needs blending). Eric says alpha blending kills the performance.

This makes me think that maybe I should expose only one or two (cursor?)
planes with alpha blending formats, and all other planes with only
opaque formats. That would naturally limit the compositor's use of
planes to cases where it probably matters most: cursors, and opaque
video and openGL surfaces.

Then all the alpha-blended stuff will hit the fallback... which is...
Pixman at the moment (thinking about Weston here) until Eric gets
the GLESv2 flying. :-/

That means that doing a driver-specific kernel/user ABI for using the
HVS seems required. Write driver-specific libdrm API for it, use it
directly in Weston, see what falls out later.

Thanks,
pq