[PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

Tue May 5 17:59:52 UTC 2020

Can you file a full bug report on the gitlab tracker?

FreeSync is still working on my Navi setups with this patch applied, and 
this patch is essentially just a revert of another patch already (where 
FreeSync worked before).

I can understand the v2 of this series causing issues, but the v1 
shouldn't be - so I'd like to understand more about the setup where this 
is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc.

Regards,
Nicholas Kazlauskas

On 2020-05-05 1:03 p.m., Alex Deucher wrote:
> Mario or Nick any thoughts?
> 
> Alex
> 
> On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13 at gmail.com> wrote:
>>
>> Hey guys,
>>
>> This is still an issue for me, and I'm still having to run a patch to
>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
>> setups in 5.7, is there any news on this? Has anyone else at the very
>> least been able to reproduce the problem?
>>
>> It happens for me in every single program that mesa allows to utilize
>> variable refresh rates, and reverting it "fixes" the issue.
>>
>> Cheers, and sorry for the extra email, just making sure this is still on
>> someone's radar,
>> Matt
>>
>> On 4/14/20 5:32 PM, Matt Coffin wrote:
>>> Hey everyone,
>>>
>>> This patch broke variable refresh rate in games (all that I've tried so
>>> far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
>>> well as a simple freesync tester application.
>>>
>>> FreeSync tester I've been using: https://github.com/Nixola/VRRTest
>>>
>>> I'm not at all familiar with the page flipping code, so it would take me
>>> a long time to find the *right* way to fix it, but does someone else see
>>> why it would do that?
>>>
>>> The symptom is that the refresh rate of the display constantly bounces
>>> between the two ends of the FreeSync range (for me 40 -> 144), and the
>>> game stutters like a madman.
>>>
>>> Any help on where to start, ideas on how to fix it (other than just
>>> revert this commit, which I've done in the interim), or alternative
>>> patches would be appreciated.
>>>
>>> Thanks in advance for the work/help,
>>> Matt
>>>
>>> On 3/13/20 8:42 AM, Michel Dänzer wrote:
>>>> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>>>>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>>>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>>>>> <mario.kleiner.de at gmail.com> wrote:
>>>>>>>
>>>>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>>>>> completion handling for DCN, and some trouble.
>>>>>>>
>>>>>>> The current implementation introduces a race condition, which
>>>>>>> can cause pageflip completion events to be sent out one vblank
>>>>>>> too early, thereby confusing userspace and causing flicker:
>>>>>>>
>>>>>>> prepare_flip_isr():
>>>>>>>
>>>>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>>>>> 3. Releases ddev->event_lock.
>>>>>>>
>>>>>>> --> Deadline for surface address regs double-buffering passes on
>>>>>>>       target pipe.
>>>>>>>
>>>>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>>>>      into hw, but too late for current vblank.
>>>>>>>
>>>>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>>>>      in current vblank due to missing the double-buffering deadline
>>>>>>>      by a tiny bit.
>>>>>>>
>>>>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>>>>      dm_dcn_crtc_high_irq() gets called.
>>>>>>>
>>>>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>>>>      pageflip has been completed/will complete in this vblank and
>>>>>>>      sends out pageflip completion event to userspace and resets
>>>>>>>      pflip_status = AMDGPU_FLIP_NONE.
>>>>>>>
>>>>>>> => Flip completion event sent out one vblank too early.
>>>>>>>
>>>>>>> This behaviour has been observed during my testing with measurement
>>>>>>> hardware a couple of time.
>>>>>>>
>>>>>>> The commit message says that the extra flip event code was added to
>>>>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>>>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>>>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>>>>> this clock gating may happen if no planes are active. According to
>>>>>>> Nicholas, the clock gating can also happen if psr is active, and the
>>>>>>> gating is controlled independently by the hardware, so difficult to
>>>>>>> detect if and when the completion code in above commit is needed.
>>>>>>>
>>>>>>> This patch tries the following solution: It only executes the extra
>>>>>>> pflip
>>>>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>>>>> that there aren't any surface updated pending in the double-buffered
>>>>>>> surface scanout address registers. Otherwise it leaves pflip completion
>>>>>>> to the pflip irq handler, for a more race-free experience.
>>>>>>>
>>>>>>> This would only guard against the order of events mentioned above.
>>>>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>>>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>>>>> programmed
>>>>>>> at all, ie. no surface update pending because none yet programmed
>>>>>>> into hw.
>>>>>>>
>>>>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>>>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>>>>>> under event_lock protection within the same critical section.
>>>>>>>
>>>>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>>>>
>>>>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>>>>>> Seems to work without causing obvious new trouble.
>>>>>>
>>>>>> Nick, any comments on this?  Can we get this committed or do you think
>>>>>> it needs additional rework?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>>
>>>>> Hi Alex, Mario,
>>>>>
>>>>> This might be a little strange, but if we want to get this in as a fix
>>>>> for regressions caused by the original vblank and user events at
>>>>> vstartup patch then I'm actually going to give my reviewed by on the
>>>>> *v1* of this patch (but not this v2):
>>>>>
>>>>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas at amd.com>
>>>>>
>>>>> You can feel free to apply that one.
>>>>>
>>>>> Reason 1: After having thought about it some more I don't think we
>>>>> enable anything today that has hubp powered down at the same time we
>>>>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
>>>>> entry. Static screen interrupt should happen after that flip finishes I
>>>>> think.
>>>>>
>>>>> The CRTC can still be powered on with zero planes, and I don't think any
>>>>> userspace explicitly asks for vblank events in this case but it doesn't
>>>>> hurt to have the check.
>>>>>
>>>>> Reason 2: This new patch will need much more thorough testing from side
>>>>> to fully understand the consequences of locking the entire DC commit
>>>>> sequence. For just a page flip that sounds fine, but for anything more
>>>>> than (eg. full updates, modesets, etc) I don't think we want to be
>>>>> disabling interrupts for potentially many milliseconds.
>>>>
>>>> Ah! I was wondering where the attached splat comes from, but I think
>>>> this explains it: With this patch amdgpu_dm_commit_planes keeps the
>>>> pcrtc->dev->event_lock spinlock locked while calling
>>>> dc_commit_updates_for_stream, which ends up calling
>>>> smu_set_display_count, which tries to lock a mutex.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>