[RFC 0/9] nuclear pageflip

Fri Sep 14 07:45:18 PDT 2012

On Fri, Sep 14, 2012 at 8:58 AM, Ville Syrjälä
<ville.syrjala at linux.intel.com> wrote:
> On Fri, Sep 14, 2012 at 08:25:53AM -0500, Rob Clark wrote:
>> On Fri, Sep 14, 2012 at 7:50 AM, Ville Syrjälä
>> <ville.syrjala at linux.intel.com> wrote:
>> > On Thu, Sep 13, 2012 at 11:35:59AM -0500, Rob Clark wrote:
>> >> On Thu, Sep 13, 2012 at 9:29 AM, Ville Syrjälä
>> >> <ville.syrjala at linux.intel.com> wrote:
>> >> > On Thu, Sep 13, 2012 at 08:39:54AM -0500, Rob Clark wrote:
>> >> >> On Thu, Sep 13, 2012 at 3:40 AM, Ville Syrjälä
>> >> >> <ville.syrjala at linux.intel.com> wrote:
>> [snip]
>> >> >> >
>> >> >> > I would say this is going to be the most common use case if you consider
>> >> >> > just the number of shipping devices. It's pretty much what every Android
>> >> >> > phone/tablet with a HDMI port has to do.
>> >> >>
>> >> >> bleh, surfaceflinger kinda sucks then..
>> >> >
>> >> > Why? This use case is not enforced by surfaceflinger, it's just the use
>> >> > case most devices would have.
>> >> >
>> >> > I don't think there's anything wrong with the way surfaceflinger is designed
>> >> > with the prepare and commit phases. How else would you do it?
>> >>
>> >> well, maybe I misunderstood how surfaceflinger works, but it sounded
>> >> like it has one prepare/commit phase across outputs, vs what weston
>> >> compositor does where each output is rendered and flipped
>> >> independently at the rate of that particular output.  If the two
>> >> outputs just happen to be vsync aligned, you would end up flipping at
>> >> the same time, but if the are not locked you don't have any artificial
>> >> constraint in the rendering/flipping.
>> >
>> > OK so it's purely a pull based model, whereas surfaceflinger is more
>> > push based.
>> >
>> > I suppose it might be possible to make surfaceflinger support a pull
>> > model by driving the compositor loop through a combined signal from
>> > multiple outputs. But IIRC it did have some timing related code in
>> > there somewhere, so it might not be happy about it. It might also
>>
>> As I understood, at least in older versions android versions,
>> rendering was based on a timer as there was no vblank event to
>> userspace on most SoC platforms (which sounds strange, but so far most
>> SoC's are using fbdev and/or crazy hacks rather than drm/kms)
>>
>> not sure if the timer is still there.. but I hope it goes away, it is
>> really a horrible way to keep track of vsync
>
> I've only looked at ICS in any detail. At least there we used the page
> flip event from one display to set the pace of the compositor loop.
> IIRC JB is supposed to have some vsync related changes, but I haven't
> looked at the code.
>
>> > affect the clients' rendering speed since the compositor would be
>> > pulling their buffers from queue at non-constant speed. I don't
>> > remember the details of the buffer management very well, so I can't be
>> > sure though. But I probably wouldn't bother trying this, since the
>> > straightforward approach is so simple, and the results are reasonably
>> > good.
>> >
>> > The pull model does seem more flexible. But it does require a bit of
>> > extra complexity in the compositor to avoid compositing the same scene
>> > multiple times needlessly when multiple cloned displays are involved.
>> > I suppose ideally you'd want to recompose for each display to minimize
>> > visible latency, but from power usage POV it may not be a good idea.
>>
>> fwiw, weston is already being pretty clever about keeping track of
>> damage and minimizing the area of the screen that must be re-rendered.
>>  I'm not sure if SF does anything like this.
>
> IIRC it can do that, but the EGL implementation needs to support
> EGL_BUFFER_PRESERVED.
>
> I suppose the best way to implement EGL_BUFFER_PRESERVED with
> page flips would be to schedule the flip and immediately perform
> a blit from the new front buffer to the new back buffer. Well,
> unless the hardware has some more clever mechanism for it.
>
> Does weston depend on preserved flips too, or can it even track
> damage independently for each buffer?

well, weston knows how many buffers are at play.  So it takes the
union of the damage from the last time the buffer was used (well,
currently it assumes only double buffered) and the new damage.  This
way it avoids need for the gl driver, which doesn't know as well what
is going on as the app, from needing to do a back-blit.  It can do
this because w/ drm/gbm egl winsys, eglSwapBuffers() doesn't actually
swap the buffers on the display and weston is in charge of which
buffer is displayed or rendered.  Weston explicitly calls page flip
ioctl.  The good news being that it can atomically flip overlay layers
at the same time once the new ioctl is in place.

Maybe it is useful to look at http://github.com/robclark/kmscube .. it
doesn't actually use planes, but shows the interaction of egl and kms.
 Maybe I should enhance it w/ multiple rotating cubes on different
overlays. ;-)

>> >> >> >> >From userspace API, I guess something like:
>> >> >> >>
>> >> >> >> struct drm_mode_crtc_atomic_page_flip {
>> >> >> >>       uint32_t flags;
>> >> >> >>       uint32_t count_crtcs;
>> >> >> >>       uint64_t crtc_ids_ptr;  /* array of uint32_t */
>> >> >> >>       uint64_t count_props_ptr; /* array of uint32_t, # of prop's per crtc */
>> >> >> >>       uint64_t props_ptr;  /* ptr to array of drm_mode_obj_set_property */
>> >> >> >>       uint64_t user_data;
>> >> >> >> };
>> >> >> >
>> >> >> > Starting to look much like my drm_mode_atomic struct :)
>> >> >> >
>> >> >> > Let's compare:
>> >> >> >
>> >> >> > struct drm_mode_atomic {
>> >> >> >         __u32 flags;
>> >> >> >         __u32 count_objs;
>> >> >> >         __u64 objs_ptr;
>> >> >> >         __u64 count_props_ptr;
>> >> >> >         __u64 props_ptr;
>> >> >> >         __u64 prop_values_ptr;
>> >> >> >         __u64 blob_values_ptr;
>> >> >> > };
>> >> >>
>> >> >> well, you do miss userdata, I think
>> >> >
>> >> > Sure, because I didn't add the event stuff yet.
>> >>
>> >> note that the test phase doesn't need vblank events, and also
>> >> shouldn't -EBUSY if there is still a pending flip[*],
>> >
>> > Right. Personally I'm not a fan of the EBUSY behaviour at all. Seems
>> > a bit pointless since user space can take care of it via the event
>> > mechanism. But I suppose you want it for omap so that you can avoid
>> > having to write software workarounds to overcome the GO bit
>> > limitations.
>>
>> I the main issue is disconnecting an overlay from one crtc and
>> connecting to another.. I would expect that any hw which can connect
>> an ovl to more than one possible crtc would have the same limit (ie.
>> have to wait until scanout on previous crtc completes), so I think
>> EBUSY is a good way to indicate to userspace that the requested
>> configuration is not possible *now* but would be possible in the
>> future.
>
> Intel HW can do the transition automagically, but if you try to
> combine it with other page flips, the driver would have to perform some
> gynmastics to make things appear atomic. Of course if you'd try to swap
> overlay A from pipe 1 to pipe 2, and overlay B from pipe 2 to pipe 1 at
> the same time, there's just no way to do that without sacrificing
> atomicity on one of the pipes.
>
> So even with such HW, it's probably easier to forget about the feature,
> and require user space to perform the disable+enable sequence in two steps.

true, but I don't want to block the disable until vblank w/
atomic-pageflip, and if userspace re-enables the plane on a different
crtc before the next vblank, it would be useful for the driver to have
a way to say 'try again later'.

And if we do support multiple crtc's w/ pageflip, I'm not sure if
there is a good way to enforce two-steps.  Having a standardized way
to tell userspace to try later seems like a good thing.

>
>> >> >> Also, if you pageflip on multiple CRTC's, should the be multiple
>> >> >> vblank events, and multiple userdata's?
>> >> >
>> >> > That's a bit of an open question. I was considering several options:
>> >>
>> >> the thing I like about one ioctl per crtc is that it avoids this whole
>> >> question..
>> >>
>> >> And, I think as long as you have to update multiple different scanout
>> >> address registers, there is always going to be a race in multi-crtc
>> >> flipping.  Having a single ioctl does make the race smaller.  I'm not
>> >> sure how important that point is.
>> >
>> > Which race?
>>
>> ie. if you set REG_CRTC1_ADDR just immediately before vblank and
>> REG_CRTC2_ADDR just after
>
> Well, with unsynced crtcs I wouldn't call that any kind of meaningful race.
> The same problem after all exists even with a single crtc. You either make
> the deadline and write the register before vblank, or you don't make it
> and end up with a repeated frame.

I meant w/ sync'd crtc's, there is still no 100% guarantee that the
two flip at the same time.  With unsync'd crtc's there is no point for
the single ioctl.

BR,
-R

> --
> Ville Syrjälä
> Intel OTC
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel