[PATCH] SWDEV-476969 - dm/amdgpu: Fail dm_atomic_check if cursor overlay is required at MAX_SURFACES
Melissa Wen
mwen at igalia.com
Wed Nov 27 15:45:19 UTC 2024
On 18/11/2024 09:52, Melissa Wen wrote:
>
>
>
> On 14/11/2024 16:04, Mario Limonciello wrote:
>> Although it's really useful information for AMD people, the Jira
>> shouldn't be in the "title" of the commit message.
>>
>> "If" we want to get into the habit of including this information for
>> display code we should come up with a prescriptive field that goes
>> into the commit message during promotion and it should be part of all
>> patches in the promotion that have it.
>>
>> Something like this:
>>
>> AMD-Jira: SWDEV-476969
>>
>> Probably need to align that with other stakeholders though before
>> starting that way.
>>
>> On 11/14/2024 08:37, Zaeem Mohamed wrote:
>>> [why]
>>> Prevent index-out-of-bounds due to requiring cursor overlay when
>>> plane_count is MAX_SURFACES.
>>>
>>> [how]
>>> Bounds check on plane_count when requiring overlay cursor.
>>>
>>
>> Any link to failing bugs or anything like that you can include?
> Hi Mario,
>
> About this question, these are the related issues:
> - Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3693
> - Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3594
>
> You can find more details in previous iterations related to this bug:
> -
> https://lore.kernel.org/amd-gfx/575d66c7-e77d-42ea-acbf-412d6e508a0b@igalia.com/
> -
> https://lore.kernel.org/amd-gfx/20240925154324.348774-1-mwen@igalia.com/
>>
>>> Co-developed-by: Melissa Wen <mwen at igalia.com>
>>> Signed-off-by: Zaeem Mohamed <zaeem.mohamed at amd.com>
>>
>> You're missing Melisaa's SoB for a co-developed patch.
>> IIRC this should fail checkpatch.
> I already mentioned before, I don't think I actually co-dev this code,
> so Zaeem can remove it in the next iteration.
>
> ---
>
> BTW, about the implementation:
>
> As I don't have the proper environment, I asked reporters to check
> this patch up and it doesn't help prevent interface freeze.
> It seems to prevent the out of bounds bug but is causing a page fault
> now:
>
> kernel: BUG: unable to handle page fault for address: 000000000174e354
>
> From their feedback, I found curious that MAX_SURFACES -> 4 prevents
> the freeze but not completely solve the problem.
> MAX_SURFACES -> 6 solves it, what let me consider that the
> MAX_SURFACES vs MAX_SURFACE_NUM vs MAX_PLANE mismatch might be related.
> I'm going to analyzing the logs but you can find more details here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/3693#note_2658994
Hi Zaeem,
I managed to reproduce the interface freeze on kernel v6.11.4 (Fedora
41) and
amd-staging-drm-next and I have some findings to share that might be
relevant.
- I'm running Fedora 41 or Pop!_OS 22.04 with Cosmic on DCN301. The
issue was
also reported on DCN312 and DCN314.
- Steps to reproduce the interface freeze:
1. Start a Cosmic session.
2. Open firefox and drag to the right half of the desktop (multitask
resizing).
3. Open file manager and drag to the top-left quarter of the desktop.
<- Here the interface freezes.
- Running v6.11.4 kernel (Fedora), we can see a bunch of
array-index-out-of-bounds error on the dmesg, but the last error is
actually
the page-fault error reported by one user. Which means, the
page-fault isn't
related to Zaeem's patch.
- dmesg:
https://gitlab.freedesktop.org/-/project/4522/uploads/619a13e78cc5e9a645fb208ddff592b1/index-out-of-bounds-plus-page-fault-fedora41-cosmic-6-11-kernel.txt
- Running amd-staging-drm-next kernel, there is only the page-fault
error, and
no array-index-out-of-bounds. So the page-fault is the main issue
since the
beginning.
- dmesg:
https://gitlab.freedesktop.org/-/project/4522/uploads/f17b85e66f0a8c7e98013ad3b6f174f8/page-fault-fedora41-cosmic-asdn-kernel.txt
The stack trace of the page fault error [1] points to
commit_minimal_transition_state machinery, so I suspect the mismatch between
MAX_SURFACE and MAX_SURFACE_NUM may be related.
I'm investigating the problem and adding more info to the AMD issue:
- https://gitlab.freedesktop.org/drm/amd/-/issues/3693#note_2675384
[1]
[Nov26 21:33] BUG: unable to handle page fault for address: 0000000051d0f08b
[ +0.000015] #PF: supervisor read access in kernel mode
[ +0.000006] #PF: error_code(0x0000) - not-present page
[ +0.000005] PGD 0 P4D 0
[ +0.000007] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000006] CPU: 4 PID: 71 Comm: kworker/u32:6 Not tainted 6.10.0+ #300
[ +0.000006] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0131 01/30/2024
[ +0.000007] Workqueue: events_unbound commit_work [drm_kms_helper]
[ +0.000040] RIP: 0010:copy_stream_update_to_stream.isra.0+0x30d/0x750
[amdgpu]
[ +0.000847] Code: 8b 10 49 89 94 24 f8 00 00 00 48 8b 50 08 49 89 94
24 00 01 00 00 8b 40 10 41 89 84 24 08 01 00 00 49 8b 45 78 48 85 c0 74
0b <0f> b6 00 41 88 84 24 90 64 00 00 49 8b 45 60 48 85 c0 74 3b 48 8b
[ +0.000010] RSP: 0018:ffffc203802f79a0 EFLAGS: 00010206
[ +0.000009] RAX: 0000000051d0f08b RBX: 0000000000000004 RCX:
ffff9f964f0a8070
[ +0.000004] RDX: ffff9f9710f90e40 RSI: ffff9f96600c8000 RDI:
ffff9f964f000000
[ +0.000004] RBP: ffffc203802f79f8 R08: 0000000000000000 R09:
0000000000000000
[ +0.000005] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff9f96600c8000
[ +0.000004] R13: ffff9f9710f90e40 R14: ffff9f964f000000 R15:
ffff9f96600c8000
[ +0.000004] FS: 0000000000000000(0000) GS:ffff9f9970000000(0000)
knlGS:0000000000000000
[ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000005] CR2: 0000000051d0f08b CR3: 00000002e6a20000 CR4:
0000000000350ef0
[ +0.000005] Call Trace:
[ +0.000011] <TASK>
[ +0.000010] ? __die_body.cold+0x19/0x27
[ +0.000012] ? page_fault_oops+0x15a/0x2d0
[ +0.000014] ? exc_page_fault+0x7e/0x180
[ +0.000009] ? asm_exc_page_fault+0x26/0x30
[ +0.000013] ? copy_stream_update_to_stream.isra.0+0x30d/0x750 [amdgpu]
[ +0.000739] ? dc_commit_state_no_check+0xd6c/0xe70 [amdgpu]
[ +0.000470] update_planes_and_stream_state+0x49b/0x4f0 [amdgpu]
[ +0.000450] ? srso_return_thunk+0x5/0x5f
[ +0.000009] ? commit_minimal_transition_state+0x239/0x3d0 [amdgpu]
[ +0.000446] update_planes_and_stream_v2+0x24a/0x590 [amdgpu]
[ +0.000464] ? srso_return_thunk+0x5/0x5f
[ +0.000009] ? sort+0x31/0x50
[ +0.000007] ? amdgpu_dm_atomic_commit_tail+0x159f/0x3a30 [amdgpu]
[ +0.000508] ? srso_return_thunk+0x5/0x5f
[ +0.000009] ? amdgpu_crtc_get_scanout_position+0x28/0x40 [amdgpu]
[ +0.000377] ? srso_return_thunk+0x5/0x5f
[ +0.000009] ?
drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x160/0x390 [drm]
[ +0.000058] ? srso_return_thunk+0x5/0x5f
[ +0.000005] ? dma_fence_default_wait+0x8c/0x260
[ +0.000010] ? srso_return_thunk+0x5/0x5f
[ +0.000005] ? wait_for_completion_timeout+0x13b/0x170
[ +0.000006] ? srso_return_thunk+0x5/0x5f
[ +0.000005] ? dma_fence_wait_timeout+0x108/0x140
[ +0.000010] ? commit_tail+0x94/0x130 [drm_kms_helper]
[ +0.000024] ? process_one_work+0x177/0x330
[ +0.000008] ? worker_thread+0x266/0x3a0
[ +0.000006] ? __pfx_worker_thread+0x10/0x10
[ +0.000004] ? kthread+0xd2/0x100
[ +0.000006] ? __pfx_kthread+0x10/0x10
[ +0.000006] ? ret_from_fork+0x34/0x50
[ +0.000004] ? __pfx_kthread+0x10/0x10
[ +0.000005] ? ret_from_fork_asm+0x1a/0x30
[ +0.000011] </TASK>
>
> BR,
>
> Melissa Wen
>>
>>> ---
>>> amdgpu_dm/amdgpu_dm.c | 10 +++++++++-
>>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/amdgpu_dm/amdgpu_dm.c b/amdgpu_dm/amdgpu_dm.c
>>> index 97e0a1bbba..964497c613 100644
>>> --- a/amdgpu_dm/amdgpu_dm.c
>>> +++ b/amdgpu_dm/amdgpu_dm.c
>>> @@ -11821,8 +11821,16 @@ static int amdgpu_dm_atomic_check(struct
>>> drm_device *dev,
>>> /* Overlay cusor not subject to native cursor
>>> restrictions */
>>> dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
>>> - if (dm_new_crtc_state->cursor_mode == DM_CURSOR_OVERLAY_MODE)
>>> + if (dm_new_crtc_state->cursor_mode ==
>>> DM_CURSOR_OVERLAY_MODE) {
>>> + if (dc->current_state->stream_status->plane_count >
>>> MAX_SURFACES) {
>>> + drm_dbg_driver(crtc->dev,
>>> + "Can't enable cursor plane with %d
>>> planes\n", MAX_SURFACES);
>>> + ret = -EINVAL;
>>> + goto fail;
>>> + }
>>> +
>>> continue;
>>> + }
>>> /* Check if rotation or scaling is enabled on DCN401 */
>>> if ((drm_plane_mask(crtc->cursor) &
>>> new_crtc_state->plane_mask) &&
>>
>
More information about the amd-gfx
mailing list