[PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

Alex Hung alex.hung at amd.com
Thu Dec 15 21:02:07 UTC 2022



On 2022-12-15 08:17, Harry Wentland wrote:
> 
> 
> On 12/15/22 05:29, Michel Dänzer wrote:
>> On 12/15/22 09:09, Christian König wrote:
>>> Am 15.12.22 um 00:33 schrieb Alex Hung:
>>>> On 2022-12-14 16:06, Alex Deucher wrote:
>>>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>> On 2022-12-14 15:35, Alex Deucher wrote:
>>>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>>> On 2022-12-14 14:54, Alex Deucher wrote:
>>>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote:
>>>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
>>>>>>>>>>> <aurabindo.pillai at amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> From: Alex Hung <alex.hung at amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> [Why]
>>>>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from
>>>>>>>>>>>> msleep no matter how long or short the time is.
>>>>>>>>>>>>
>>>>>>>>>>>> [How]
>>>>>>>>>>>> To replace msleep with mdelay.
>>>>>>>>>>>
>>>>>>>>>>> Can you provide a bit more info about the crash?  A lot of platforms
>>>>>>>>>>> don't support delay larger than 2-4ms so this change will generate
>>>>>>>>>>> errors on ARM and possibly other platforms.
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display
>>>>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at
>>>>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT
>>>>>>>>>> (https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>>
>>>>>>>>>> It is possible to workaround this by reversing order of
>>>>>>>>>> igt_remove_fb(&buffer[i]), as the following example:
>>>>>>>>>>
>>>>>>>>>>        igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>>
>>>>>>>>>> Hangs:
>>>>>>>>>>        igt_remove_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>>
>>>>>>>>>> No hangs:
>>>>>>>>>>        igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]
>>>>>>>>>>
>>>>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop
>>>>>>>>>> the hang from occurring.
>>>>>>>>>>
>>>>>>>>>> I also tried to remove the msleep completely and it also work, but I
>>>>>>>>>> didn't want to break the fix for the original problematic hardware
>>>>>>>>>> configuration.
>>>>>>>>>
>>>>>>>>> Why does sleep vs delay make a difference?  Is there some race that we
>>>>>>>>> are not locking against?
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>
>>>>>>>> That was my original thought but I did not find any previously. I will
>>>>>>>> investigate it again.
>>>>>>>>
>>>>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use
>>>>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove
>>>>>>>> the msleep for other platforms, something like
>>>>>>>>
>>>>>>>> -                       msleep(60);
>>>>>>>> +#ifdef CONFIG_X86_64
>>>>>>>> +                       mdelay(60);
>>>>>>>> +#endif
>>>>>>>
>>>>>>> That's pretty ugly.  I'd rather try and resolve the root cause.  How
>>>>>>> important is the IGT test?  What does it do?  Is the test itself
>>>>>>> correct?
>>>>>>>
>>>>>>
>>>>>> Agreed, and I didn't want to add conditions around the mdelay for the
>>>>>> same reason. I will assume this is not an option now.
>>>>>>
>>>>>> As in the previous comment, IGT can be modified to avoid the crash by
>>>>>> reversing the order fb is removed - though I suspect I will receive
>>>>>> questions why this is not fixed in kernel.
>>>>>>
>>>>>> I wanted to fix this in kernel because nothing stops other user-space
>>>>>> applications to use the same way to crash kernel, so fixing IGT is the
>>>>>> second option.
>>>>>>
>>>>>> Apparently causing problems on other platforms isn't an option at all so
>>>>>> I will try to figure out an non-mdelay solution, and then maybe an IGT
>>>>>> solution instead.
>>>>>
>>>>> What hangs?  The test or the kernel or the hardware?
>>>>
>>>> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.
>>>
>>> I agree with Alex that changing this is extremely questionable and not justified at all.
>>>
>>> My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else.
>>>
>>> This clearly points to missing locking or similar to protect concurrent execution of things.
>> Might another possibility be that this code gets called from an atomic context which can't sleep?
>>
>>
> 
> It can come through handle_hpd_rx_irq but we're using a workqueue
> to queue interrupt handling so this shouldn't come from an atomic
> context. I currently don't see where else it might be used in an
> atomic context. Alex Hung, can you do a dump_stack() in this function
> to see where the problematic call is coming from?


IGT's kms_bw executes as below (when passing)

IGT-Version: 1.26-gf4067678 (x86_64) (Linux: 5.19.0-99-custom x86_64)
Starting subtest: linear-tiling-1-displays-1920x1080p
Subtest linear-tiling-1-displays-1920x1080p: SUCCESS (0.225s)
Starting subtest: linear-tiling-1-displays-2560x1440p
Subtest linear-tiling-1-displays-2560x1440p: SUCCESS (0.111s)
Starting subtest: linear-tiling-1-displays-3840x2160p
Subtest linear-tiling-1-displays-3840x2160p: SUCCESS (0.118s)
Starting subtest: linear-tiling-2-displays-1920x1080p
Subtest linear-tiling-2-displays-1920x1080p: SUCCESS (0.409s)
Starting subtest: linear-tiling-2-displays-2560x1440p
Subtest linear-tiling-2-displays-2560x1440p: SUCCESS (0.417s)
Starting subtest: linear-tiling-2-displays-3840x2160p
Subtest linear-tiling-2-displays-3840x2160p: SUCCESS (0.444s)
Starting subtest: linear-tiling-3-displays-1920x1080p
Subtest linear-tiling-3-displays-1920x1080p: SUCCESS (0.547s)
Starting subtest: linear-tiling-3-displays-2560x1440p
Subtest linear-tiling-3-displays-2560x1440p: SUCCESS (0.555s)
Starting subtest: linear-tiling-3-displays-3840x2160p
Subtest linear-tiling-3-displays-3840x2160p: SUCCESS (0.586s)
Starting subtest: linear-tiling-4-displays-1920x1080p
Subtest linear-tiling-4-displays-1920x1080p: SUCCESS (0.734s)
Starting subtest: linear-tiling-4-displays-2560x1440p
Subtest linear-tiling-4-displays-2560x1440p: SUCCESS (0.742s)
Starting subtest: linear-tiling-4-displays-3840x2160p
Subtest linear-tiling-4-displays-3840x2160p: SUCCESS (0.778s)
Starting subtest: linear-tiling-5-displays-1920x1080p
Subtest linear-tiling-5-displays-1920x1080p: SUCCESS (0.734s)
Starting subtest: linear-tiling-5-displays-2560x1440p
Subtest linear-tiling-5-displays-2560x1440p: SUCCESS (0.743s)
Starting subtest: linear-tiling-5-displays-3840x2160p
Subtest linear-tiling-5-displays-3840x2160p: SUCCESS (0.781s)
Starting subtest: linear-tiling-6-displays-1920x1080p
Test requirement not met in function run_test_linear_tiling, file 
../tests/kms_bw.c:156:
Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-1920x1080p: SKIP (0.000s)
Starting subtest: linear-tiling-6-displays-2560x1440p
Test requirement not met in function run_test_linear_tiling, file 
../tests/kms_bw.c:156:
Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-2560x1440p: SKIP (0.000s)
Starting subtest: linear-tiling-6-displays-3840x2160p
Test requirement not met in function run_test_linear_tiling, file 
../tests/kms_bw.c:156:
Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-3840x2160p: SKIP (0.000s)

The crash usually occurs when executing 
"linear-tiling-3-displays-1920x1080p" most of time, but the crash can 
also occurs at "linear-tiling-3-displays-2560x1440p"

============
This is dump_stack right before the failing msleep.

[IGT] kms_bw: starting subtest linear-tiling-3-displays-1920x1080p
CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126
Workqueue: events drm_mode_rmfb_work_fn [drm]
Call Trace:
  <TASK>
  dump_stack_lvl+0x49/0x63
  dump_stack+0x10/0x16
  dce110_blank_stream.cold+0x5/0x14 [amdgpu]
  core_link_disable_stream+0xe0/0x6b0 [amdgpu]
  ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
  dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
  dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
  ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
  ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
  ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
  dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
  ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
  dc_commit_state+0x10f/0x150 [amdgpu]
  amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]
  ? dcn30_internal_validate_bw+0x349/0xa00 [amdgpu]
  ? slab_post_alloc_hook+0x53/0x270
  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
  ? dc_validate_global_state+0x27a/0x330 [amdgpu]
  ? slab_post_alloc_hook+0x53/0x270
  ? __kmalloc+0x18c/0x300
  ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
  ? preempt_count_add+0x7c/0xc0
  ? _raw_spin_unlock_irq+0x1f/0x40
  ? wait_for_completion_timeout+0x114/0x140
  ? preempt_count_add+0x7c/0xc0
  ? _raw_spin_unlock_irq+0x1f/0x40
  commit_tail+0x99/0x140 [drm_kms_helper]
  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
  drm_atomic_commit+0x79/0x100 [drm]
  ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
  drm_framebuffer_remove+0x499/0x510 [drm]
  drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
  process_one_work+0x21d/0x3f0
  worker_thread+0x1fa/0x3c0
  ? process_one_work+0x3f0/0x3f0
  kthread+0xff/0x130
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>


============
If msleep is replaced by mdelay, the dump_stack is almost the same:

$ diff mdelay.log msleep.log
<  dce110_blank_stream.cold+0x5/0x1f [amdgpu]
---
 >  dce110_blank_stream.cold+0x5/0x14 [amdgpu]


============
If the IGT fix is applied (i.e. no crashes when removing buffers[i] 
reversely by "igt_remove_fb", the dump_stack outputs are the same:

$ diff msleep_igt.log msleep.log
2c2
< CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
---
 > CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126

============
Note the msleep doesn't always trigger crashes. One of the passing 
dump_stack is as below:

[IGT] kms_bw: starting subtest linear-tiling-2-displays-1920x1080p
CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
Workqueue: events drm_mode_rmfb_work_fn [drm]
Call Trace:
  <TASK>
  dump_stack_lvl+0x49/0x63
  dump_stack+0x10/0x16
  dce110_blank_stream.cold+0x5/0x14 [amdgpu]
  core_link_disable_stream+0xe0/0x6b0 [amdgpu]
  ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
  dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
  dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
  ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
  ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
  ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
  dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
  ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
  dc_commit_state+0x10f/0x150 [amdgpu]
  amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]
  ? debug_smp_processor_id+0x17/0x20
  ? dc_fpu_end+0x4e/0xd0 [amdgpu]
  ? dcn316_populate_dml_pipes_from_context+0x69/0x2a0 [amdgpu]
  ? dcn31_calculate_wm_and_dlg_fp+0x50/0x530 [amdgpu]
  ? kernel_fpu_end+0x26/0x50
  ? dc_fpu_end+0x4e/0xd0 [amdgpu]
  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
  ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
  ? dc_validate_global_state+0x27a/0x330 [amdgpu]
  ? slab_post_alloc_hook+0x53/0x270
  ? __kmalloc+0x18c/0x300
  ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
  ? preempt_count_add+0x7c/0xc0
  ? _raw_spin_unlock_irq+0x1f/0x40
  ? wait_for_completion_timeout+0x114/0x140
  ? preempt_count_add+0x7c/0xc0
  ? _raw_spin_unlock_irq+0x1f/0x40
  commit_tail+0x99/0x140 [drm_kms_helper]
  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
  drm_atomic_commit+0x79/0x100 [drm]
  ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
  drm_framebuffer_remove+0x499/0x510 [drm]
  drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
  process_one_work+0x21d/0x3f0
  worker_thread+0x1fa/0x3c0
  ? process_one_work+0x3f0/0x3f0
  kthread+0xff/0x130
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>



> 
> Fixing IGT will only mask the issue. Userspace should never be able
> to put the system in a state where it stops responding entirely. This
> will need some sort of fix in the kernel.
> 
> Harry
> 
> 


More information about the amd-gfx mailing list