[RFC 0/9] nuclear pageflip

Fri Sep 14 09:29:04 PDT 2012

On Fri, Sep 14, 2012 at 10:48 AM, Ville Syrjälä
<ville.syrjala at linux.intel.com> wrote:
> On Fri, Sep 14, 2012 at 09:45:18AM -0500, Rob Clark wrote:
>> On Fri, Sep 14, 2012 at 8:58 AM, Ville Syrjälä
>> <ville.syrjala at linux.intel.com> wrote:
>> > On Fri, Sep 14, 2012 at 08:25:53AM -0500, Rob Clark wrote:
>> >> On Fri, Sep 14, 2012 at 7:50 AM, Ville Syrjälä
>> >> <ville.syrjala at linux.intel.com> wrote:
>> >> > On Thu, Sep 13, 2012 at 11:35:59AM -0500, Rob Clark wrote:
>> >> >> On Thu, Sep 13, 2012 at 9:29 AM, Ville Syrjälä
>> >> >> <ville.syrjala at linux.intel.com> wrote:
>> >> >> > On Thu, Sep 13, 2012 at 08:39:54AM -0500, Rob Clark wrote:
>> >> >> >> On Thu, Sep 13, 2012 at 3:40 AM, Ville Syrjälä
>> >> >> >> <ville.syrjala at linux.intel.com> wrote:
>> >> [snip]
>> >> >> >> >
>> >> >> >> > I would say this is going to be the most common use case if you consider
>> >> >> >> > just the number of shipping devices. It's pretty much what every Android
>> >> >> >> > phone/tablet with a HDMI port has to do.
>> >> >> >>
>> >> >> >> bleh, surfaceflinger kinda sucks then..
>> >> >> >
>> >> >> > Why? This use case is not enforced by surfaceflinger, it's just the use
>> >> >> > case most devices would have.
>> >> >> >
>> >> >> > I don't think there's anything wrong with the way surfaceflinger is designed
>> >> >> > with the prepare and commit phases. How else would you do it?
>> >> >>
>> >> >> well, maybe I misunderstood how surfaceflinger works, but it sounded
>> >> >> like it has one prepare/commit phase across outputs, vs what weston
>> >> >> compositor does where each output is rendered and flipped
>> >> >> independently at the rate of that particular output.  If the two
>> >> >> outputs just happen to be vsync aligned, you would end up flipping at
>> >> >> the same time, but if the are not locked you don't have any artificial
>> >> >> constraint in the rendering/flipping.
>> >> >
>> >> > OK so it's purely a pull based model, whereas surfaceflinger is more
>> >> > push based.
>> >> >
>> >> > I suppose it might be possible to make surfaceflinger support a pull
>> >> > model by driving the compositor loop through a combined signal from
>> >> > multiple outputs. But IIRC it did have some timing related code in
>> >> > there somewhere, so it might not be happy about it. It might also
>> >>
>> >> As I understood, at least in older versions android versions,
>> >> rendering was based on a timer as there was no vblank event to
>> >> userspace on most SoC platforms (which sounds strange, but so far most
>> >> SoC's are using fbdev and/or crazy hacks rather than drm/kms)
>> >>
>> >> not sure if the timer is still there.. but I hope it goes away, it is
>> >> really a horrible way to keep track of vsync
>> >
>> > I've only looked at ICS in any detail. At least there we used the page
>> > flip event from one display to set the pace of the compositor loop.
>> > IIRC JB is supposed to have some vsync related changes, but I haven't
>> > looked at the code.
>> >
>> >> > affect the clients' rendering speed since the compositor would be
>> >> > pulling their buffers from queue at non-constant speed. I don't
>> >> > remember the details of the buffer management very well, so I can't be
>> >> > sure though. But I probably wouldn't bother trying this, since the
>> >> > straightforward approach is so simple, and the results are reasonably
>> >> > good.
>> >> >
>> >> > The pull model does seem more flexible. But it does require a bit of
>> >> > extra complexity in the compositor to avoid compositing the same scene
>> >> > multiple times needlessly when multiple cloned displays are involved.
>> >> > I suppose ideally you'd want to recompose for each display to minimize
>> >> > visible latency, but from power usage POV it may not be a good idea.
>> >>
>> >> fwiw, weston is already being pretty clever about keeping track of
>> >> damage and minimizing the area of the screen that must be re-rendered.
>> >>  I'm not sure if SF does anything like this.
>> >
>> > IIRC it can do that, but the EGL implementation needs to support
>> > EGL_BUFFER_PRESERVED.
>> >
>> > I suppose the best way to implement EGL_BUFFER_PRESERVED with
>> > page flips would be to schedule the flip and immediately perform
>> > a blit from the new front buffer to the new back buffer. Well,
>> > unless the hardware has some more clever mechanism for it.
>> >
>> > Does weston depend on preserved flips too, or can it even track
>> > damage independently for each buffer?
>>
>> well, weston knows how many buffers are at play.  So it takes the
>> union of the damage from the last time the buffer was used (well,
>> currently it assumes only double buffered) and the new damage.
>
> With more buffer it'll get a bit more complicate as it needs to keep
> accumulating the damage for all buffers. But it should still be fairly
> trivial when you're in full control of the buffers.

well, just track previous damage per buffer.. but yeah, slightly more
complicated

>> This
>> way it avoids need for the gl driver, which doesn't know as well what
>> is going on as the app, from needing to do a back-blit.  It can do
>> this because w/ drm/gbm egl winsys, eglSwapBuffers() doesn't actually
>> swap the buffers on the display and weston is in charge of which
>> buffer is displayed or rendered.  Weston explicitly calls page flip
>> ioctl.  The good news being that it can atomically flip overlay layers
>> at the same time once the new ioctl is in place.
>
> Yeah, with EGL in the mix, as can be the case with Android, the layering
> can start to work against you a little bit. Well, it's not too bad based
> on my experience though.
>
>> Maybe it is useful to look at http://github.com/robclark/kmscube .. it
>> doesn't actually use planes, but shows the interaction of egl and kms.
>>  Maybe I should enhance it w/ multiple rotating cubes on different
>> overlays. ;-)
>
> When doing the Medfield work, I had a test case which utilized the
> video overlays and the primary plane. What it did was draw ugly
> colored rectangles on the primary layer, and positioned the overlays
> to cover those up exactly. When the atomic page flip system was
> operating as intended you could only see a solid color on the screen.
> It randomly changed the position and size of the rectangles and
> overlays. I really need to dig that up modify it to work with my
> current code.

it would be nice to resurrect this using drm/gbm stuff so we can use
it as test code on multiple platform.  (I've tested kmscube on i915
and omap)

>> >> >> note that the test phase doesn't need vblank events, and also
>> >> >> shouldn't -EBUSY if there is still a pending flip[*],
>> >> >
>> >> > Right. Personally I'm not a fan of the EBUSY behaviour at all. Seems
>> >> > a bit pointless since user space can take care of it via the event
>> >> > mechanism. But I suppose you want it for omap so that you can avoid
>> >> > having to write software workarounds to overcome the GO bit
>> >> > limitations.
>> >>
>> >> I the main issue is disconnecting an overlay from one crtc and
>> >> connecting to another.. I would expect that any hw which can connect
>> >> an ovl to more than one possible crtc would have the same limit (ie.
>> >> have to wait until scanout on previous crtc completes), so I think
>> >> EBUSY is a good way to indicate to userspace that the requested
>> >> configuration is not possible *now* but would be possible in the
>> >> future.
>> >
>> > Intel HW can do the transition automagically, but if you try to
>> > combine it with other page flips, the driver would have to perform some
>> > gynmastics to make things appear atomic. Of course if you'd try to swap
>> > overlay A from pipe 1 to pipe 2, and overlay B from pipe 2 to pipe 1 at
>> > the same time, there's just no way to do that without sacrificing
>> > atomicity on one of the pipes.
>> >
>> > So even with such HW, it's probably easier to forget about the feature,
>> > and require user space to perform the disable+enable sequence in two steps.
>>
>> true, but I don't want to block the disable until vblank w/
>> atomic-pageflip, and if userspace re-enables the plane on a different
>> crtc before the next vblank, it would be useful for the driver to have
>> a way to say 'try again later'.
>
> Yeah, trying to do it all asynchronously in the kernel would perhaps
> be too much work for little gain.
>
> I wonder if you've though about omap's FIFO merge. It can cause similar
> issues, that is some operations may need two vblanks to complete. And it
> looks like I'll get to worry about this stuff too since there are some
> watermark related wait_for_vblank() workarounds in the IVB sprite code,
> sigh.

yeah, FIFO merge is a nice big headache.. and not really ideal for
latency unless you have some advanced warning to disable FIFO merge
before userspace wants to switch on an extra overlay.

I think the best way to deal is just start switching off FIFO merge
when userspace first does test w/ overlay, but return EBUSY.  It means
we'll use the gpu for rendering for one frame, but I think that is
better than blocking the compositor for a vblank or two.  Thou shalt
not block the compositor.

>> And if we do support multiple crtc's w/ pageflip, I'm not sure if
>> there is a good way to enforce two-steps.  Having a standardized way
>> to tell userspace to try later seems like a good thing.
>
> Sure, for that it seems reasonable.
>
>> >> >> >> Also, if you pageflip on multiple CRTC's, should the be multiple
>> >> >> >> vblank events, and multiple userdata's?
>> >> >> >
>> >> >> > That's a bit of an open question. I was considering several options:
>> >> >>
>> >> >> the thing I like about one ioctl per crtc is that it avoids this whole
>> >> >> question..
>> >> >>
>> >> >> And, I think as long as you have to update multiple different scanout
>> >> >> address registers, there is always going to be a race in multi-crtc
>> >> >> flipping.  Having a single ioctl does make the race smaller.  I'm not
>> >> >> sure how important that point is.
>> >> >
>> >> > Which race?
>> >>
>> >> ie. if you set REG_CRTC1_ADDR just immediately before vblank and
>> >> REG_CRTC2_ADDR just after
>> >
>> > Well, with unsynced crtcs I wouldn't call that any kind of meaningful race.
>> > The same problem after all exists even with a single crtc. You either make
>> > the deadline and write the register before vblank, or you don't make it
>> > and end up with a repeated frame.
>>
>> I meant w/ sync'd crtc's, there is still no 100% guarantee that the
>> two flip at the same time.
>
> Sure there is. That's what the vblank evade stuff gives you. I just
> happen to need it even when using just one crtc because the hardware
> doesn't have the necessary mechanism to flip several planes atomically.

hmm, I guess I don't quite follow then.  But I guess I don't know the
intel hw well enough.  It seemed like you weren't atomically updating
scanout registers.

But anyways, I think it is probably ok to not need the crtc up-front.
We can catch issues w/ pending vblank at the atomic_test() stage.
Still not sure what to do about userdata.  Although I suppose we could
make userdata a property attached to crtc and/or plane and that gives
userspace plenty of flexibility about how many events it wants back.
(Ie. no event if userdata==0.. or maybe separate send-event property.)

BR,
-R

> --
> Ville Syrjälä
> Intel OTC
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel