[PATCH] SWDEV-476969 - dm/amdgpu: Fail dm_atomic_check if cursor overlay is required at MAX_SURFACES

Melissa Wen mwen at igalia.com
Wed Nov 27 15:45:19 UTC 2024




On 18/11/2024 09:52, Melissa Wen wrote:
>
>
>
> On 14/11/2024 16:04, Mario Limonciello wrote:
>> Although it's really useful information for AMD people, the Jira 
>> shouldn't be in the "title" of the commit message.
>>
>> "If" we want to get into the habit of including this information for 
>> display code we should come up with a prescriptive field that goes 
>> into the commit message during promotion and it should be part of all 
>> patches in the promotion that have it.
>>
>> Something like this:
>>
>> AMD-Jira: SWDEV-476969
>>
>> Probably need to align that with other stakeholders though before 
>> starting that way.
>>
>> On 11/14/2024 08:37, Zaeem Mohamed wrote:
>>> [why]
>>> Prevent index-out-of-bounds due to requiring cursor overlay when
>>> plane_count is MAX_SURFACES.
>>>
>>> [how]
>>> Bounds check on plane_count when requiring overlay cursor.
>>>
>>
>> Any link to failing bugs or anything like that you can include?
> Hi Mario,
>
> About this question, these are the related issues:
> - Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3693
> - Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3594
>
> You can find more details in previous iterations related to this bug:
> - 
> https://lore.kernel.org/amd-gfx/575d66c7-e77d-42ea-acbf-412d6e508a0b@igalia.com/
> - 
> https://lore.kernel.org/amd-gfx/20240925154324.348774-1-mwen@igalia.com/
>>
>>> Co-developed-by: Melissa Wen <mwen at igalia.com>
>>> Signed-off-by: Zaeem Mohamed <zaeem.mohamed at amd.com>
>>
>> You're missing Melisaa's SoB for a co-developed patch.
>> IIRC this should fail checkpatch.
> I already mentioned before, I don't think I actually co-dev this code, 
> so Zaeem can remove it in the next iteration.
>
> ---
>
> BTW, about the implementation:
>
> As I don't have the proper environment, I asked reporters to check 
> this patch up and it doesn't help prevent interface freeze.
> It seems to prevent the out of bounds bug but is causing a page fault 
> now:
>
> kernel: BUG: unable to handle page fault for address: 000000000174e354
>
> From their feedback, I found curious that MAX_SURFACES -> 4 prevents 
> the freeze but not completely solve the problem.
> MAX_SURFACES -> 6 solves it, what let me consider that the 
> MAX_SURFACES vs MAX_SURFACE_NUM vs MAX_PLANE mismatch might be related.
> I'm going to analyzing the logs but you can find more details here: 
> https://gitlab.freedesktop.org/drm/amd/-/issues/3693#note_2658994

Hi Zaeem,

I managed to reproduce the interface freeze on kernel v6.11.4 (Fedora 
41) and
amd-staging-drm-next and I have some findings to share that might be 
relevant.

- I'm running Fedora 41 or Pop!_OS 22.04 with Cosmic on DCN301. The 
issue was
   also reported on DCN312 and DCN314.

- Steps to reproduce the interface freeze:
   1. Start a Cosmic session.
   2. Open firefox and drag to the right half of the desktop (multitask
      resizing).
   3. Open file manager and drag to the top-left quarter of the desktop. 
<- Here the interface freezes.
- Running v6.11.4 kernel (Fedora), we can see a bunch of
   array-index-out-of-bounds error on the dmesg, but the last error is 
actually
   the page-fault error reported by one user. Which means, the 
page-fault isn't
   related to Zaeem's patch.
   - dmesg: 
https://gitlab.freedesktop.org/-/project/4522/uploads/619a13e78cc5e9a645fb208ddff592b1/index-out-of-bounds-plus-page-fault-fedora41-cosmic-6-11-kernel.txt
- Running amd-staging-drm-next kernel, there is only the page-fault 
error, and
   no array-index-out-of-bounds. So the page-fault is the main issue 
since the
   beginning.
   - dmesg: 
https://gitlab.freedesktop.org/-/project/4522/uploads/f17b85e66f0a8c7e98013ad3b6f174f8/page-fault-fedora41-cosmic-asdn-kernel.txt

The stack trace of the page fault error [1] points to
commit_minimal_transition_state machinery, so I suspect the mismatch between
MAX_SURFACE and MAX_SURFACE_NUM may be related.
I'm investigating the problem and adding more info to the AMD issue:
- https://gitlab.freedesktop.org/drm/amd/-/issues/3693#note_2675384

[1]

[Nov26 21:33] BUG: unable to handle page fault for address: 0000000051d0f08b
[  +0.000015] #PF: supervisor read access in kernel mode
[  +0.000006] #PF: error_code(0x0000) - not-present page
[  +0.000005] PGD 0 P4D 0
[  +0.000007] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.000006] CPU: 4 PID: 71 Comm: kworker/u32:6 Not tainted 6.10.0+ #300
[  +0.000006] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0131 01/30/2024
[  +0.000007] Workqueue: events_unbound commit_work [drm_kms_helper]
[  +0.000040] RIP: 0010:copy_stream_update_to_stream.isra.0+0x30d/0x750 
[amdgpu]
[  +0.000847] Code: 8b 10 49 89 94 24 f8 00 00 00 48 8b 50 08 49 89 94 
24 00 01 00 00 8b 40 10 41 89 84 24 08 01 00 00 49 8b 45 78 48 85 c0 74 
0b <0f> b6 00 41 88 84 24 90 64 00 00 49 8b 45 60 48 85 c0 74 3b 48 8b
[  +0.000010] RSP: 0018:ffffc203802f79a0 EFLAGS: 00010206
[  +0.000009] RAX: 0000000051d0f08b RBX: 0000000000000004 RCX: 
ffff9f964f0a8070
[  +0.000004] RDX: ffff9f9710f90e40 RSI: ffff9f96600c8000 RDI: 
ffff9f964f000000
[  +0.000004] RBP: ffffc203802f79f8 R08: 0000000000000000 R09: 
0000000000000000
[  +0.000005] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff9f96600c8000
[  +0.000004] R13: ffff9f9710f90e40 R14: ffff9f964f000000 R15: 
ffff9f96600c8000
[  +0.000004] FS:  0000000000000000(0000) GS:ffff9f9970000000(0000) 
knlGS:0000000000000000
[  +0.000005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000005] CR2: 0000000051d0f08b CR3: 00000002e6a20000 CR4: 
0000000000350ef0
[  +0.000005] Call Trace:
[  +0.000011]  <TASK>
[  +0.000010]  ? __die_body.cold+0x19/0x27
[  +0.000012]  ? page_fault_oops+0x15a/0x2d0
[  +0.000014]  ? exc_page_fault+0x7e/0x180
[  +0.000009]  ? asm_exc_page_fault+0x26/0x30
[  +0.000013]  ? copy_stream_update_to_stream.isra.0+0x30d/0x750 [amdgpu]
[  +0.000739]  ? dc_commit_state_no_check+0xd6c/0xe70 [amdgpu]
[  +0.000470]  update_planes_and_stream_state+0x49b/0x4f0 [amdgpu]
[  +0.000450]  ? srso_return_thunk+0x5/0x5f
[  +0.000009]  ? commit_minimal_transition_state+0x239/0x3d0 [amdgpu]
[  +0.000446]  update_planes_and_stream_v2+0x24a/0x590 [amdgpu]
[  +0.000464]  ? srso_return_thunk+0x5/0x5f
[  +0.000009]  ? sort+0x31/0x50
[  +0.000007]  ? amdgpu_dm_atomic_commit_tail+0x159f/0x3a30 [amdgpu]
[  +0.000508]  ? srso_return_thunk+0x5/0x5f
[  +0.000009]  ? amdgpu_crtc_get_scanout_position+0x28/0x40 [amdgpu]
[  +0.000377]  ? srso_return_thunk+0x5/0x5f
[  +0.000009]  ? 
drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x160/0x390 [drm]
[  +0.000058]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? dma_fence_default_wait+0x8c/0x260
[  +0.000010]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? wait_for_completion_timeout+0x13b/0x170
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? dma_fence_wait_timeout+0x108/0x140
[  +0.000010]  ? commit_tail+0x94/0x130 [drm_kms_helper]
[  +0.000024]  ? process_one_work+0x177/0x330
[  +0.000008]  ? worker_thread+0x266/0x3a0
[  +0.000006]  ? __pfx_worker_thread+0x10/0x10
[  +0.000004]  ? kthread+0xd2/0x100
[  +0.000006]  ? __pfx_kthread+0x10/0x10
[  +0.000006]  ? ret_from_fork+0x34/0x50
[  +0.000004]  ? __pfx_kthread+0x10/0x10
[  +0.000005]  ? ret_from_fork_asm+0x1a/0x30
[  +0.000011]  </TASK>

>
> BR,
>
> Melissa Wen
>>
>>> ---
>>>   amdgpu_dm/amdgpu_dm.c | 10 +++++++++-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/amdgpu_dm/amdgpu_dm.c b/amdgpu_dm/amdgpu_dm.c
>>> index 97e0a1bbba..964497c613 100644
>>> --- a/amdgpu_dm/amdgpu_dm.c
>>> +++ b/amdgpu_dm/amdgpu_dm.c
>>> @@ -11821,8 +11821,16 @@ static int amdgpu_dm_atomic_check(struct 
>>> drm_device *dev,
>>>             /* Overlay cusor not subject to native cursor 
>>> restrictions */
>>>           dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
>>> -        if (dm_new_crtc_state->cursor_mode == DM_CURSOR_OVERLAY_MODE)
>>> +        if (dm_new_crtc_state->cursor_mode == 
>>> DM_CURSOR_OVERLAY_MODE) {
>>> +            if (dc->current_state->stream_status->plane_count > 
>>> MAX_SURFACES) {
>>> +                drm_dbg_driver(crtc->dev,
>>> +                       "Can't enable cursor plane with %d 
>>> planes\n", MAX_SURFACES);
>>> +                ret = -EINVAL;
>>> +                goto fail;
>>> +            }
>>> +
>>>               continue;
>>> +        }
>>>             /* Check if rotation or scaling is enabled on DCN401 */
>>>           if ((drm_plane_mask(crtc->cursor) & 
>>> new_crtc_state->plane_mask) &&
>>
>



More information about the amd-gfx mailing list