[PATCH 1/5] drm/amd/display: Remove migrate_en/dis from dc_fpu_begin().

Hamza Mahfooz hamza.mahfooz at amd.com
Wed Oct 4 12:10:35 UTC 2023


On 10/3/23 15:53, Harry Wentland wrote:
> On 2023-09-21 10:15, Sebastian Andrzej Siewior wrote:
>> This is a revert of the commit mentioned below while it is not wrong, as
>> in the kernel will explode, having migrate_disable() here it is
>> complete waste of resources.
>>
>> Additionally commit message is plain wrong the review tag does not make
> 
> Not sure I follow what's unhelpful about the review tag with
> 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
> 
> I do wish the original patch showed the splat it's attempting
> to fix. It apparently made a difference for something, whether
> inadvertently or not. I wish I knew what that "something" was.

I did some digging, and it seems like the intention of that patch was to
fix the following splat:

WARNING: CPU: 5 PID: 1062 at 
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:71 
dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
[...]
CPU: 5 PID: 1062 Comm: Xorg Tainted: G           OE 
5.15.0-56-generic #62-Ubuntu
Hardware name: ASUS System Product Name/ROG STRIX Z590-F GAMING WIFI, 
BIOS 1202 10/27/2021
RIP: 0010:dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
Code: ff 45 31 f6 0f 0b e9 ca fe ff ff e8 90 1c 1f f7 48 c7 c0 00 30 03 
00 65 48 03 05 b1 aa 86 3f 8b 00 85 c0 7e 05 c3 cc cc cc cc <0f> 0b c3 
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
RSP: 0000:ffffb89b82a8f118 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8c271cd00000 RCX: 0000000000000000
RDX: ffff8c2708025000 RSI: ffff8c270e8c0000 RDI: ffff8c271cd00000
RBP: ffffb89b82a8f1d0 R08: 0000000000000000 R09: 7fffff6affffffff
R10: ffffb89b82a8f240 R11: 0000000000000000 R12: 0000000000000002
R13: ffff8c271cd00000 R14: ffff8c270e8c0000 R15: ffff8c2708025000
FS:  00007f0570019a80(0000) GS:ffff8c2e3fb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005594858a0058 CR3: 000000010e204001 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
  <TASK>
  ? dcn20_populate_dml_pipes_from_context+0x47/0x1730 [amdgpu]
  ? __kmalloc_node+0x2cb/0x3a0
  dcn32_populate_dml_pipes_from_context+0x2b/0x450 [amdgpu]
  dcn32_internal_validate_bw+0x15f9/0x2670 [amdgpu]
  dcn32_find_dummy_latency_index_for_fw_based_mclk_switch+0xd0/0x310 
[amdgpu]
  dcn32_calculate_wm_and_dlg_fpu+0xe6/0x1e50 [amdgpu]
  dcn32_calculate_wm_and_dlg+0x46/0x70 [amdgpu]
  dcn32_validate_bandwidth+0x1b7/0x3e0 [amdgpu]
  dc_validate_global_state+0x32c/0x560 [amdgpu]
  dc_validate_with_context+0x6e6/0xd80 [amdgpu]
  dc_commit_streams+0x21b/0x500 [amdgpu]
  dc_commit_state+0xf3/0x150 [amdgpu]
  amdgpu_dm_atomic_commit_tail+0x60d/0x3120 [amdgpu]
  ? dcn32_internal_validate_bw+0xcf8/0x2670 [amdgpu]
  ? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
  ? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
  ? kfree+0x1f7/0x250
  ? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
  ? dc_validate_global_state+0x32c/0x560 [amdgpu]
  ? __cond_resched+0x1a/0x50
  ? __wait_for_common+0x3e/0x150
  ? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
  ? usleep_range_state+0x90/0x90
  ? wait_for_completion_timeout+0x1d/0x30
  commit_tail+0xc2/0x170 [drm_kms_helper]
  ? drm_atomic_helper_swap_state+0x20f/0x370 [drm_kms_helper]
  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
  amdgpu_dm_atomic_commit+0x11/0x20 [amdgpu]
  drm_atomic_commit+0x47/0x60 [drm]
  drm_mode_obj_set_property_ioctl+0x16b/0x420 [drm]
  ? mutex_lock+0x13/0x50
  ? drm_mode_createblob_ioctl+0xf6/0x130 [drm]
  ? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
  drm_ioctl_kernel+0xb0/0x100 [drm]
  drm_ioctl+0x268/0x4b0 [drm]
  ? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
  ? ktime_get_mono_fast_ns+0x4a/0xa0
  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
  __x64_sys_ioctl+0x92/0xd0
  do_syscall_64+0x59/0xc0
  ? do_user_addr_fault+0x1e7/0x670
  ? do_syscall_64+0x69/0xc0
  ? exit_to_user_mode_prepare+0x37/0xb0
  ? irqentry_exit_to_user_mode+0x9/0x20
  ? irqentry_exit+0x1d/0x30
  ? exc_page_fault+0x89/0x170
  entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f05704a2aff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 
44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 
3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007ffc8c45a3f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc8c45a480 RCX: 00007f05704a2aff
RDX: 00007ffc8c45a480 RSI: 00000000c01864ba RDI: 000000000000000e
RBP: 00000000c01864ba R08: 0000000000000077 R09: 00000000cccccccc
R10: 00007f05705a22f0 R11: 0000000000000246 R12: 0000000000000004
R13: 000000000000000e R14: 000000000000000f R15: 00007ffc8c45a8a8
  </TASK>
---[ end trace 4deab30bb69df00f ]---

> 
> Harry
> 
>> it any better. The migrate_disable() interface has a fat comment
>> describing it and it includes the word "undesired" in the headline which
>> should tickle people to read it before using it.
>> Initially I assumed it is worded too harsh but now I beg to differ.
>>
>> The reviewer of the original commit, even not understanding what
>> migrate_disable() does should ask the following:
>>
>> - migrate_disable() is added only to the CONFIG_X86 block and it claims
>>    to protect fpu_recursion_depth. Why are the other the architectures
>>    excluded?
>>
>> - migrate_disable() is added after fpu_recursion_depth was modified.
>>    Shouldn't it be added before the modification or referencing takes
>>    place?
>>
>> Moving on.
>> Disabling preemption DOES prevent CPU migration. A task, that can not be
>> pushed away from the CPU by the scheduler (due to disabled preemption)
>> can not be pushed or migrated to another CPU.
>>
>> Disabling migration DOES NOT ensure consistency of per-CPU variables. It
>> only ensures that the task acts always on the same per-CPU variable. The
>> task remains preemptible meaning multiple tasks can access the same
>> per-CPU variable. This in turn leads to inconsistency for the statement
>>
>>                    *pcpu -= 1;
>>
>> with two tasks on one CPU and a preemption point during the RMW
>> operation:
>>
>>       Task A                           Task B
>>       read pcpu to reg  # 0
>>       inc reg           # 0 -> 1
>>                                        read pcpu to reg  # 0
>>                                        inc reg           # 0 -> 1
>>                                        write reg to pcpu # 1
>>       write reg to pcpu # 1
>>
>> At the end pcpu reads 1 but should read 2 instead. Boom.
>>
>> get_cpu_ptr() already contains a preempt_disable() statement. That means
>> that the per-CPU variable can only be referenced by a single task which
>> is currently running. The only inconsistency that can occur if the
>> variable is additionally accessed from an interrupt.
>>
>> Remove migrate_disable/enable() from dc_fpu_begin/end().
>>
>> Cc: Tianci Yin <tianci.yin at amd.com>
>> Cc: Aurabindo Pillai <aurabindo.pillai at amd.com>
>> Fixes: 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy at linutronix.de>
>> ---
>>   drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> index 172aa10a8800f..86f4c0e046548 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> @@ -91,7 +91,6 @@ void dc_fpu_begin(const char *function_name, const int line)
>>   
>>   	if (*pcpu == 1) {
>>   #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>> -		migrate_disable();
>>   		kernel_fpu_begin();
>>   #elif defined(CONFIG_PPC64)
>>   		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
>> @@ -132,7 +131,6 @@ void dc_fpu_end(const char *function_name, const int line)
>>   	if (*pcpu <= 0) {
>>   #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>>   		kernel_fpu_end();
>> -		migrate_enable();
>>   #elif defined(CONFIG_PPC64)
>>   		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
>>   			disable_kernel_vsx();
> 
-- 
Hamza



More information about the dri-devel mailing list