[PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

Harry Wentland harry.wentland at amd.com
Fri Dec 16 19:47:58 UTC 2022



On 12/15/22 16:02, Alex Hung wrote:
> 
> 
> On 2022-12-15 08:17, Harry Wentland wrote:
>>
>>
>> On 12/15/22 05:29, Michel Dänzer wrote:
>>> On 12/15/22 09:09, Christian König wrote:
>>>> Am 15.12.22 um 00:33 schrieb Alex Hung:
>>>>> On 2022-12-14 16:06, Alex Deucher wrote:
>>>>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>> On 2022-12-14 15:35, Alex Deucher wrote:
>>>>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>>>> On 2022-12-14 14:54, Alex Deucher wrote:
>>>>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote:
>>>>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
>>>>>>>>>>>> <aurabindo.pillai at amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Alex Hung <alex.hung at amd.com>
>>>>>>>>>>>>>

<snip>

>>
>> It can come through handle_hpd_rx_irq but we're using a workqueue
>> to queue interrupt handling so this shouldn't come from an atomic
>> context. I currently don't see where else it might be used in an
>> atomic context. Alex Hung, can you do a dump_stack() in this function
>> to see where the problematic call is coming from?
> 
> 
> IGT's kms_bw executes as below (when passing)
> 
> IGT-Version: 1.26-gf4067678 (x86_64) (Linux: 5.19.0-99-custom x86_64)
> Starting subtest: linear-tiling-1-displays-1920x1080p
> Subtest linear-tiling-1-displays-1920x1080p: SUCCESS (0.225s)
> Starting subtest: linear-tiling-1-displays-2560x1440p
> Subtest linear-tiling-1-displays-2560x1440p: SUCCESS (0.111s)
> Starting subtest: linear-tiling-1-displays-3840x2160p
> Subtest linear-tiling-1-displays-3840x2160p: SUCCESS (0.118s)
> Starting subtest: linear-tiling-2-displays-1920x1080p
> Subtest linear-tiling-2-displays-1920x1080p: SUCCESS (0.409s)
> Starting subtest: linear-tiling-2-displays-2560x1440p
> Subtest linear-tiling-2-displays-2560x1440p: SUCCESS (0.417s)
> Starting subtest: linear-tiling-2-displays-3840x2160p
> Subtest linear-tiling-2-displays-3840x2160p: SUCCESS (0.444s)
> Starting subtest: linear-tiling-3-displays-1920x1080p
> Subtest linear-tiling-3-displays-1920x1080p: SUCCESS (0.547s)
> Starting subtest: linear-tiling-3-displays-2560x1440p
> Subtest linear-tiling-3-displays-2560x1440p: SUCCESS (0.555s)
> Starting subtest: linear-tiling-3-displays-3840x2160p
> Subtest linear-tiling-3-displays-3840x2160p: SUCCESS (0.586s)
> Starting subtest: linear-tiling-4-displays-1920x1080p
> Subtest linear-tiling-4-displays-1920x1080p: SUCCESS (0.734s)
> Starting subtest: linear-tiling-4-displays-2560x1440p
> Subtest linear-tiling-4-displays-2560x1440p: SUCCESS (0.742s)
> Starting subtest: linear-tiling-4-displays-3840x2160p
> Subtest linear-tiling-4-displays-3840x2160p: SUCCESS (0.778s)
> Starting subtest: linear-tiling-5-displays-1920x1080p
> Subtest linear-tiling-5-displays-1920x1080p: SUCCESS (0.734s)
> Starting subtest: linear-tiling-5-displays-2560x1440p
> Subtest linear-tiling-5-displays-2560x1440p: SUCCESS (0.743s)
> Starting subtest: linear-tiling-5-displays-3840x2160p
> Subtest linear-tiling-5-displays-3840x2160p: SUCCESS (0.781s)
> Starting subtest: linear-tiling-6-displays-1920x1080p
> Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156:
> Test requirement: !(pipe > num_pipes)
> ASIC does not have 5 pipes

Does this IGT patch fix the !(pipe > num_pipes) issue?

https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/commit/3bb9ee157642ec2433f01b51e09866da2b0f3dd8

> Subtest linear-tiling-6-displays-1920x1080p: SKIP (0.000s)
> Starting subtest: linear-tiling-6-displays-2560x1440p
> Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156:
> Test requirement: !(pipe > num_pipes)
> ASIC does not have 5 pipes
> Subtest linear-tiling-6-displays-2560x1440p: SKIP (0.000s)
> Starting subtest: linear-tiling-6-displays-3840x2160p
> Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156:
> Test requirement: !(pipe > num_pipes)
> ASIC does not have 5 pipes
> Subtest linear-tiling-6-displays-3840x2160p: SKIP (0.000s)
> 
> The crash usually occurs when executing "linear-tiling-3-displays-1920x1080p" most of time, but the crash can also occurs at "linear-tiling-3-displays-2560x1440p"
> 
> ============
> This is dump_stack right before the failing msleep.
> 
> [IGT] kms_bw: starting subtest linear-tiling-3-displays-1920x1080p
> CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126
> Workqueue: events drm_mode_rmfb_work_fn [drm]
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x49/0x63
>  dump_stack+0x10/0x16
>  dce110_blank_stream.cold+0x5/0x14 [amdgpu]
>  core_link_disable_stream+0xe0/0x6b0 [amdgpu]
>  ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
>  dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
>  dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
>  ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
>  ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
>  ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
>  dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
>  ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
>  dc_commit_state+0x10f/0x150 [amdgpu]
>  amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]

This doesn't look like an atomic (i.e., high IRQ) context to me.
So it's more likely there is some race going on, like Alex
Deucher suggested.

Harry

>  ? dcn30_internal_validate_bw+0x349/0xa00 [amdgpu]
>  ? slab_post_alloc_hook+0x53/0x270
>  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
>  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
>  ? dc_validate_global_state+0x27a/0x330 [amdgpu]
>  ? slab_post_alloc_hook+0x53/0x270
>  ? __kmalloc+0x18c/0x300
>  ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
>  ? preempt_count_add+0x7c/0xc0
>  ? _raw_spin_unlock_irq+0x1f/0x40
>  ? wait_for_completion_timeout+0x114/0x140
>  ? preempt_count_add+0x7c/0xc0
>  ? _raw_spin_unlock_irq+0x1f/0x40
>  commit_tail+0x99/0x140 [drm_kms_helper]
>  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
>  drm_atomic_commit+0x79/0x100 [drm]
>  ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
>  drm_framebuffer_remove+0x499/0x510 [drm]
>  drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
>  process_one_work+0x21d/0x3f0
>  worker_thread+0x1fa/0x3c0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0xff/0x130
>  ? kthread_complete_and_exit+0x20/0x20
>  ret_from_fork+0x22/0x30
>  </TASK>
> 
> 
> ============
> If msleep is replaced by mdelay, the dump_stack is almost the same:
> 
> $ diff mdelay.log msleep.log
> <  dce110_blank_stream.cold+0x5/0x1f [amdgpu]
> ---
>>  dce110_blank_stream.cold+0x5/0x14 [amdgpu]
> 
> 
> ============
> If the IGT fix is applied (i.e. no crashes when removing buffers[i] reversely by "igt_remove_fb", the dump_stack outputs are the same:
> 
> $ diff msleep_igt.log msleep.log
> 2c2
> < CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
> ---
>> CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126
> 
> ============
> Note the msleep doesn't always trigger crashes. One of the passing dump_stack is as below:
> 
> [IGT] kms_bw: starting subtest linear-tiling-2-displays-1920x1080p
> CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
> Workqueue: events drm_mode_rmfb_work_fn [drm]
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x49/0x63
>  dump_stack+0x10/0x16
>  dce110_blank_stream.cold+0x5/0x14 [amdgpu]
>  core_link_disable_stream+0xe0/0x6b0 [amdgpu]
>  ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
>  dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
>  dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
>  ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
>  ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
>  ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
>  dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
>  ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
>  dc_commit_state+0x10f/0x150 [amdgpu]
>  amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]
>  ? debug_smp_processor_id+0x17/0x20
>  ? dc_fpu_end+0x4e/0xd0 [amdgpu]
>  ? dcn316_populate_dml_pipes_from_context+0x69/0x2a0 [amdgpu]
>  ? dcn31_calculate_wm_and_dlg_fp+0x50/0x530 [amdgpu]
>  ? kernel_fpu_end+0x26/0x50
>  ? dc_fpu_end+0x4e/0xd0 [amdgpu]
>  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
>  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
>  ? dc_validate_global_state+0x27a/0x330 [amdgpu]
>  ? slab_post_alloc_hook+0x53/0x270
>  ? __kmalloc+0x18c/0x300
>  ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
>  ? preempt_count_add+0x7c/0xc0
>  ? _raw_spin_unlock_irq+0x1f/0x40
>  ? wait_for_completion_timeout+0x114/0x140
>  ? preempt_count_add+0x7c/0xc0
>  ? _raw_spin_unlock_irq+0x1f/0x40
>  commit_tail+0x99/0x140 [drm_kms_helper]
>  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
>  drm_atomic_commit+0x79/0x100 [drm]
>  ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
>  drm_framebuffer_remove+0x499/0x510 [drm]
>  drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
>  process_one_work+0x21d/0x3f0
>  worker_thread+0x1fa/0x3c0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0xff/0x130
>  ? kthread_complete_and_exit+0x20/0x20
>  ret_from_fork+0x22/0x30
>  </TASK>
> 
> 
> 
>>
>> Fixing IGT will only mask the issue. Userspace should never be able
>> to put the system in a state where it stops responding entirely. This
>> will need some sort of fix in the kernel.
>>
>> Harry
>>
>>



More information about the amd-gfx mailing list