[PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

Fri Mar 13 21:52:59 UTC 2020

On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer <michel at daenzer.net> wrote:

> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> > On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
> >> <mario.kleiner.de at gmail.com> wrote:
> >>>
> >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> >>> events at vsartup for DCN")' introduces a new way of pageflip
> >>> completion handling for DCN, and some trouble.
> >>>
> >>> The current implementation introduces a race condition, which
> >>> can cause pageflip completion events to be sent out one vblank
> >>> too early, thereby confusing userspace and causing flicker:
> >>>
> >>> prepare_flip_isr():
> >>>
> >>> 1. Pageflip programming takes the ddev->event_lock.
> >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> >>> 3. Releases ddev->event_lock.
> >>>
> >>> --> Deadline for surface address regs double-buffering passes on
> >>>      target pipe.
> >>>
> >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
> >>>     into hw, but too late for current vblank.
> >>>
> >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
> >>>     in current vblank due to missing the double-buffering deadline
> >>>     by a tiny bit.
> >>>
> >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
> >>>     dm_dcn_crtc_high_irq() gets called.
> >>>
> >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
> >>>     pageflip has been completed/will complete in this vblank and
> >>>     sends out pageflip completion event to userspace and resets
> >>>     pflip_status = AMDGPU_FLIP_NONE.
> >>>
> >>> => Flip completion event sent out one vblank too early.
> >>>
> >>> This behaviour has been observed during my testing with measurement
> >>> hardware a couple of time.
> >>>
> >>> The commit message says that the extra flip event code was added to
> >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> >>> is clock gated and doesn't fire pflip irqs in that state. Also that
> >>> this clock gating may happen if no planes are active. According to
> >>> Nicholas, the clock gating can also happen if psr is active, and the
> >>> gating is controlled independently by the hardware, so difficult to
> >>> detect if and when the completion code in above commit is needed.
> >>>
> >>> This patch tries the following solution: It only executes the extra
> >>> pflip
> >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> >>> that there aren't any surface updated pending in the double-buffered
> >>> surface scanout address registers. Otherwise it leaves pflip completion
> >>> to the pflip irq handler, for a more race-free experience.
> >>>
> >>> This would only guard against the order of events mentioned above.
> >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> >>> at all, because 1-3 + 5 might happen even without the hw being
> >>> programmed
> >>> at all, ie. no surface update pending because none yet programmed
> >>> into hw.
> >>>
> >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> >>> under event_lock protection within the same critical section.
> >>>
> >>> v2: Take Nicholas comments into account, try a different solution.
> >>>
> >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> >>> Seems to work without causing obvious new trouble.
> >>
> >> Nick, any comments on this?  Can we get this committed or do you think
> >> it needs additional rework?
> >>
> >> Thanks,
> >>
> >> Alex
> >
> > Hi Alex, Mario,
> >
> > This might be a little strange, but if we want to get this in as a fix
> > for regressions caused by the original vblank and user events at
> > vstartup patch then I'm actually going to give my reviewed by on the
> > *v1* of this patch (but not this v2):
> >
> > Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas at amd.com>
> >
> > You can feel free to apply that one.
> >
> > Reason 1: After having thought about it some more I don't think we
> > enable anything today that has hubp powered down at the same time we
> > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> > entry. Static screen interrupt should happen after that flip finishes I
> > think.
> >
> > The CRTC can still be powered on with zero planes, and I don't think any
> > userspace explicitly asks for vblank events in this case but it doesn't
> > hurt to have the check.
> >
>

Ok, so the original commit that causes the races currently solves a
non-existing - but potential future - problem?

I guess then my v1 patch makes sense for the moment and fixes the immediate
problem for Linux 5.6-rc.

Maybe there's a way to ask the hw if hubp is clock-gated and so if that
code needs to run or not in the future?

As mentioned before, it would be helpful at least for me to get a better
overview about which hw events happen when in vblank, which irq's fire in
response etc., how this relates to things like power management actions,
psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years,
and my guideline are still the public pdf files with the DCE register specs
for DCE-1'ish hw from sometime around the year 2007.

> Reason 2: This new patch will need much more thorough testing from side
> > to fully understand the consequences of locking the entire DC commit
> > sequence. For just a page flip that sounds fine, but for anything more
> > than (eg. full updates, modesets, etc) I don't think we want to be
> > disabling interrupts for potentially many milliseconds.
>
> Ah! I was wondering where the attached splat comes from, but I think
> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> pcrtc->dev->event_lock spinlock locked while calling
> dc_commit_updates_for_stream, which ends up calling
> smu_set_display_count, which tries to lock a mutex.
>
>
Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and
that function only gets called on Navi /DCN2 afaics. All my testing is
limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm.

thanks,
-mario

-- 
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200313/211a7b5b/attachment.htm>