How to design a DRM KMS driver exposing 2D compositing?

Tue Aug 12 09:10:47 PDT 2014

Pekka Paalanen <ppaalanen at gmail.com> writes:

> On Mon, 11 Aug 2014 19:27:45 +0200
> Daniel Vetter <daniel at ffwll.ch> wrote:
>
>> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
>> > Daniel Vetter <daniel at ffwll.ch> writes:
>> > 
>> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> > >> Hi,
>> > >> 
>> > >> there is some hardware than can do 2D compositing with an arbitrary
>> > >> number of planes. I'm not sure what the absolute maximum number of
>> > >> planes is, but for the discussion, let's say it is 100.
>> > >> 
>> > >> There are many complicated, dynamic constraints on how many, what size,
>> > >> etc. planes can be used at once. A driver would be able to check those
>> > >> before kicking the 2D compositing engine.
>> > >> 
>> > >> The 2D compositing engine in the best case (only few planes used) is
>> > >> able to composite on the fly in scanout, just like the usual overlay
>> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
>> > >> driver can fall back to compositing into a buffer rather than on the
>> > >> fly in scanout. This fallback needs to be completely transparent to the
>> > >> user space, implying only additional latency if anything.
>> > >> 
>> > >> These 2D compositing features should be exposed to user space through a
>> > >> standard kernel ABI, hopefully an existing ABI in the very near future
>> > >> like the KMS atomic.
>> > >
>> > > I presume we're talking about the video core from raspi? Or at least
>> > > something similar?
>> > 
>> > Pekka wasn't sure if things were confidential here, but I can say it:
>> > Yeah, it's the RPi.
>> > 
>> > While I haven't written code using the compositor interface (I just did
>> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
>> > company can handle the rest for me :) ), my understanding is that the
>> > way you make use of it is that you've got your previous frame loaded up
>> > in the HVS (the plane compositor hardware), then when you're asked to
>> > put up a new frame that's going to be too hard, you take some
>> > complicated chunk of your scene and ask the HVS to use any spare
>> > bandwidth it has while it's still scanning out the previous frame in
>> > order to composite that piece of new scene into memory.  Then, when it's
>> > done with the offline composite, you ask the HVS to do the next scanout
>> > frame using the original scene with the pre-composited temporary buffer.
>> > 
>> > I'm pretty comfortable with the idea of having some large number of
>> > planes preallocated, and deciding that "nobody could possibly need more
>> > than 16" (or whatever).
>> > 
>> > My initial reaction to "we should just punt when we run out of bandwidth
>> > and have a special driver interface for offline composite" was "that's
>> > awful, when the kernel could just get the job done immediately, and
>> > easily, and it would know exactly what it needed to composite to get
>> > things to fit (unlike userspace)".  I'm trying to come up with what
>> > benefit there would be to having a separate interface for offline
>> > composite.  I've got 3 things:
>> > 
>> > - Avoids having a potentially long, interruptible wait in the modeset
>> >   path while the offline composite happens.  But I think we have other
>> >   interruptible waits in that path alreaady.
>> > 
>> > - Userspace could potentially do something else besides use the HVS to
>> >   get the fallback done.  Video would have to use the HVS, to get the
>> >   same scaling filters applied as the previous frame where things *did*
>> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>> >   which would have more BW available to it than what you're borrowing
>> >   from the previous frame's HVS capacity.
>> > 
>> > - Userspace could potentially use the offline composite interface for
>> >   things besides just the running-out-of-bandwidth case.  Like, it was
>> >   doing a nicely-filtered downscale of an overlaid video, then the user
>> >   hit pause and walked away: you could have a timeout that noticed that
>> >   the complicated scene hadn't changed in a while, and you'd drop from
>> >   overlays to a HVS-composited single plane to reduce power.
>> > 
>> > The third one is the one I've actually found kind of compelling, and
>> > might be switching me from wanting no userspace visibility into the
>> > fallback.  But I don't have a good feel for how much complexity there is
>> > to our descriptions of planes, and how much poorly-tested interface we'd
>> > be adding to support this usecase.
>> 
>> Compositor should already do a rough bw guesstimate and if stuff doesn't
>> change any more bake the entire scene into a single framebuffer. The exact
>> same issue happens on more usual hw with video overlays, too.
>> 
>> Ofc if it turns out that scanning out your yuv planes is less bw then the
>> overlay shouldn't be stopped ofc. But imo there's nothing special here for
>> the rpi.
>>  
>> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
>> > understanding of the bandwidth equation is that you're mostly counting
>> > the number of pixels that have to be read, and clipped-out pixels
>> > because somebody's overlaid on top of you don't count unless they're in
>> > the same burst read.  So unless people are going nuts with blending in
>> > overlays, or downscaled video, it's probably not a problem, and
>> > something that gets your pixels on the screen at all is sufficient)
>> 
>> Yeah I guess we need to check reality here. If the "we've run out of bw"
>> case just never happens then it's pointless to write special code for it.
>> And we can always add a limit later for the case where GL is usually
>> better and tell userspace that we can't do this many planes. Exact same
>> thing with running out of memory bw can happen anywhere else, too.
>
> I had a chat with Eric last night, and our different views about the
> on-line/real-time performance limits of the HVS seem to be due to alpha
> blending.
>
> Eric has not been using alpha blending much or at all, while my
> experiments with Weston and DispmanX pretty much always need alpha
> blending (e.g. because DispmanX cannot say that only a sub-region of a
> buffer needs blending). Eric says alpha blending kills the
> performance.

Note, I wasn't saying anything about performance.  I was just talking
about how compositing in X knows that (almost) everything is actually
opaque, so I don't have the worries about alpha blending that you
apparently do in Weston.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20140812/7e5e6b21/attachment-0001.sig>