[RFC 0/9] nuclear pageflip

Fri Sep 14 06:58:41 PDT 2012

On Fri, Sep 14, 2012 at 08:25:53AM -0500, Rob Clark wrote:
> On Fri, Sep 14, 2012 at 7:50 AM, Ville Syrjälä
> <ville.syrjala at linux.intel.com> wrote:
> > On Thu, Sep 13, 2012 at 11:35:59AM -0500, Rob Clark wrote:
> >> On Thu, Sep 13, 2012 at 9:29 AM, Ville Syrjälä
> >> <ville.syrjala at linux.intel.com> wrote:
> >> > On Thu, Sep 13, 2012 at 08:39:54AM -0500, Rob Clark wrote:
> >> >> On Thu, Sep 13, 2012 at 3:40 AM, Ville Syrjälä
> >> >> <ville.syrjala at linux.intel.com> wrote:
> [snip]
> >> >> >
> >> >> > I would say this is going to be the most common use case if you consider
> >> >> > just the number of shipping devices. It's pretty much what every Android
> >> >> > phone/tablet with a HDMI port has to do.
> >> >>
> >> >> bleh, surfaceflinger kinda sucks then..
> >> >
> >> > Why? This use case is not enforced by surfaceflinger, it's just the use
> >> > case most devices would have.
> >> >
> >> > I don't think there's anything wrong with the way surfaceflinger is designed
> >> > with the prepare and commit phases. How else would you do it?
> >>
> >> well, maybe I misunderstood how surfaceflinger works, but it sounded
> >> like it has one prepare/commit phase across outputs, vs what weston
> >> compositor does where each output is rendered and flipped
> >> independently at the rate of that particular output.  If the two
> >> outputs just happen to be vsync aligned, you would end up flipping at
> >> the same time, but if the are not locked you don't have any artificial
> >> constraint in the rendering/flipping.
> >
> > OK so it's purely a pull based model, whereas surfaceflinger is more
> > push based.
> >
> > I suppose it might be possible to make surfaceflinger support a pull
> > model by driving the compositor loop through a combined signal from
> > multiple outputs. But IIRC it did have some timing related code in
> > there somewhere, so it might not be happy about it. It might also
> 
> As I understood, at least in older versions android versions,
> rendering was based on a timer as there was no vblank event to
> userspace on most SoC platforms (which sounds strange, but so far most
> SoC's are using fbdev and/or crazy hacks rather than drm/kms)
> 
> not sure if the timer is still there.. but I hope it goes away, it is
> really a horrible way to keep track of vsync

I've only looked at ICS in any detail. At least there we used the page
flip event from one display to set the pace of the compositor loop.
IIRC JB is supposed to have some vsync related changes, but I haven't
looked at the code.

> > affect the clients' rendering speed since the compositor would be
> > pulling their buffers from queue at non-constant speed. I don't
> > remember the details of the buffer management very well, so I can't be
> > sure though. But I probably wouldn't bother trying this, since the
> > straightforward approach is so simple, and the results are reasonably
> > good.
> >
> > The pull model does seem more flexible. But it does require a bit of
> > extra complexity in the compositor to avoid compositing the same scene
> > multiple times needlessly when multiple cloned displays are involved.
> > I suppose ideally you'd want to recompose for each display to minimize
> > visible latency, but from power usage POV it may not be a good idea.
> 
> fwiw, weston is already being pretty clever about keeping track of
> damage and minimizing the area of the screen that must be re-rendered.
>  I'm not sure if SF does anything like this.

IIRC it can do that, but the EGL implementation needs to support
EGL_BUFFER_PRESERVED.

I suppose the best way to implement EGL_BUFFER_PRESERVED with
page flips would be to schedule the flip and immediately perform
a blit from the new front buffer to the new back buffer. Well,
unless the hardware has some more clever mechanism for it.

Does weston depend on preserved flips too, or can it even track
damage independently for each buffer?

> >> >> >> >From userspace API, I guess something like:
> >> >> >>
> >> >> >> struct drm_mode_crtc_atomic_page_flip {
> >> >> >>       uint32_t flags;
> >> >> >>       uint32_t count_crtcs;
> >> >> >>       uint64_t crtc_ids_ptr;  /* array of uint32_t */
> >> >> >>       uint64_t count_props_ptr; /* array of uint32_t, # of prop's per crtc */
> >> >> >>       uint64_t props_ptr;  /* ptr to array of drm_mode_obj_set_property */
> >> >> >>       uint64_t user_data;
> >> >> >> };
> >> >> >
> >> >> > Starting to look much like my drm_mode_atomic struct :)
> >> >> >
> >> >> > Let's compare:
> >> >> >
> >> >> > struct drm_mode_atomic {
> >> >> >         __u32 flags;
> >> >> >         __u32 count_objs;
> >> >> >         __u64 objs_ptr;
> >> >> >         __u64 count_props_ptr;
> >> >> >         __u64 props_ptr;
> >> >> >         __u64 prop_values_ptr;
> >> >> >         __u64 blob_values_ptr;
> >> >> > };
> >> >>
> >> >> well, you do miss userdata, I think
> >> >
> >> > Sure, because I didn't add the event stuff yet.
> >>
> >> note that the test phase doesn't need vblank events, and also
> >> shouldn't -EBUSY if there is still a pending flip[*],
> >
> > Right. Personally I'm not a fan of the EBUSY behaviour at all. Seems
> > a bit pointless since user space can take care of it via the event
> > mechanism. But I suppose you want it for omap so that you can avoid
> > having to write software workarounds to overcome the GO bit
> > limitations.
> 
> I the main issue is disconnecting an overlay from one crtc and
> connecting to another.. I would expect that any hw which can connect
> an ovl to more than one possible crtc would have the same limit (ie.
> have to wait until scanout on previous crtc completes), so I think
> EBUSY is a good way to indicate to userspace that the requested
> configuration is not possible *now* but would be possible in the
> future.

Intel HW can do the transition automagically, but if you try to
combine it with other page flips, the driver would have to perform some
gynmastics to make things appear atomic. Of course if you'd try to swap
overlay A from pipe 1 to pipe 2, and overlay B from pipe 2 to pipe 1 at
the same time, there's just no way to do that without sacrificing
atomicity on one of the pipes.

So even with such HW, it's probably easier to forget about the feature,
and require user space to perform the disable+enable sequence in two steps.

> >> >> Also, if you pageflip on multiple CRTC's, should the be multiple
> >> >> vblank events, and multiple userdata's?
> >> >
> >> > That's a bit of an open question. I was considering several options:
> >>
> >> the thing I like about one ioctl per crtc is that it avoids this whole
> >> question..
> >>
> >> And, I think as long as you have to update multiple different scanout
> >> address registers, there is always going to be a race in multi-crtc
> >> flipping.  Having a single ioctl does make the race smaller.  I'm not
> >> sure how important that point is.
> >
> > Which race?
> 
> ie. if you set REG_CRTC1_ADDR just immediately before vblank and
> REG_CRTC2_ADDR just after

Well, with unsynced crtcs I wouldn't call that any kind of meaningful race.
The same problem after all exists even with a single crtc. You either make
the deadline and write the register before vblank, or you don't make it
and end up with a repeated frame.

-- 
Ville Syrjälä
Intel OTC