[PATCH 3/3] drm/amdgpu: move iommu_resume before ip init/resume

Tue Sep 7 20:30:04 UTC 2021

On 2021-09-07 1:53 p.m., Felix Kuehling wrote:
> Am 2021-09-07 um 1:51 p.m. schrieb Felix Kuehling:
>> Am 2021-09-07 um 1:22 p.m. schrieb James Zhu:
>>> On 2021-09-07 12:48 p.m., Felix Kuehling wrote:
>>>> Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:
>>>>> Separate iommu_resume from kfd_resume, and move it before
>>>>> other amdgpu ip init/resume.
>>>>>
>>>>> Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277
>>>> I think the change is OK. But I don't understand how the IOMMUv2
>>>> initialization sequence could affect a crash in DM. The display should
>>>> not depend on IOMMUv2 at all. What am I missing?
>>> [JZ] It is a weird issue. disable VCN IP block or disable gpu_off
>>> feature, or set pci=noats, all
>>>
>>> can fix DM crash. Also the issue occurred quite random, some time
>>> after few suspend/resume cycle,
>>>
>>> some times after few hundreds S/R cycles. the maximum that I saw is
>>> 2422 S/R cycles.
>>>
>>> But every time DM crash, I can see one or two iommu errors ahead:
>>>
>>> *AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=****
>>> flags=0x0070]*
>> This error is not from IOMMUv2 doing GVA to GPA translations. It's from
>> IOMMUv1 doing GPA to SPA translation. This error points to an invalid
>> physical (GVA) address being used by the GPU to access random system
>>
>> Oops: s/GVA/GPA
>> memory it shouldn't be accessing (because there is no valid DMA mapping).
>>
>> On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is
>> enabled. It's possible that the earlier initialization of IOMMUv2 hides
>> the problem by putting the IOMMU into passthrough mode. I don't think
>> this patch series is a valid solution.

[JZ] Good to know, thanks! So amd_iommu_init_device is for v2 only.

And it is supposed to be safe to do amd_iommu_init_device after amdgpu 
IP init/resume without any interference.

>> You can probably fix the problem with this kernel boot parameter: iommu=pt

[JZ] Still not working after apply *iommu=pt*

BOOT_IMAGE=/boot/vmlinuz-5.8.0-41-generic 
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash 
modprobe.blacklist=amdgpu *iommu=pt* 3
[    0.612117] iommu: Default domain type: *Passthrough* (set via kernel 
command line)
[  354.067871] amdgpu 0000:04:00.0: AMD-Vi: Event logged 
[*IO_PAGE_FAULT* domain=0x0000 address=0x32de00040 flags=0x0070]
[  354.067884] amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0000 address=0x32de40000 flags=0x0070]

>> And you can probably reproduce it even with this patch series if instead
>> you add: iommu=nopt amd_iommu=force_isolation

[JZ] could not set both *iommu=nopt *and*  amd*_*iommu=force_isolation 
*together*. *(does it mean something?)*
*

BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+ 
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash 
modprobe.blacklist=amdgpu*iommu=nopt amd_iommu=force_isolation* 3
[    0.294242] iommu: Default domain type: Translated (set via kernel 
command line)
[    0.350675] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 
counters/bank).
[  106.403927] amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_resume 
failed (-6).
[  106.403931] PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -6
[  106.403941] amdgpu 0000:04:00.0: PM: failed to resume async: error -6

*iommu=nopt**: *Passed at least 200 S/R cycles

BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+ 
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash 
modprobe.blacklist=amdgpu *iommu=nopt* 3
[    0.294242] iommu: Default domain type: Translated (set via kernel 
command line)

*amd_iommu=force_isolation*: failed at 1st resume

BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+ 
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash 
modprobe.blacklist=amdgpu *amd_iommu=force_isolation*   3
[    0.294242] iommu: Default domain type: Translated

[   49.513262] PM: suspend entry (deep)
[   49.514404] Filesystems sync: 0.001 seconds
[   49.514668] Freezing user space processes ...
[   69.523111] Freezing of tasks failed after 20.008 seconds (2 tasks 
refusing to freeze, wq_busy=0):
[   69.523163] task:gnome-shell     state:D stack:    0 pid: 2196 ppid:  
2108 flags:0x00000004
[   69.523172] Call Trace:
[   69.523182]  __schedule+0x2ee/0x900
[   69.523193]  ? __mod_memcg_lruvec_state+0x22/0xe0
[   69.523204]  schedule+0x4f/0xc0
[   69.523214]  drm_sched_entity_flush+0x17c/0x230 [gpu_sched]
[   69.523225]  ? wait_woken+0x80/0x80
[   69.523233]  amdgpu_ctx_mgr_entity_flush+0x97/0xf0 [amdgpu]
[   69.523517]  amdgpu_flush+0x2b/0x50 [amdgpu]
[   69.523773]  filp_close+0x37/0x70
[   69.523780]  do_close_on_exec+0xda/0x110
[   69.523787]  begin_new_exec+0x59d/0xa40
[   69.523793]  load_elf_binary+0x144/0x1720
[   69.523801]  ? __kernel_read+0x1a0/0x2d0
[   69.523807]  ? __kernel_read+0x1a0/0x2d0
[   69.523812]  ? aa_get_task_label+0x49/0xd0
[   69.523820]  bprm_execve+0x288/0x680
[   69.523826]  do_execveat_common.isra.0+0x189/0x1c0
[   69.523831]  __x64_sys_execve+0x37/0x50
[   69.523836]  do_syscall_64+0x40/0xb0
[   69.523843]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   69.523851] RIP: 0033:0x7ff1244132fb
[   69.523856] RSP: 002b:00007fff91a9f2b8 EFLAGS: 00000206 ORIG_RAX: 
000000000000003b
[   69.523862] RAX: ffffffffffffffda RBX: 00007ff11ee2e180 RCX: 
00007ff1244132fb
[   69.523866] RDX: 0000561199f5bc00 RSI: 000056119a1b0890 RDI: 
0000561199f2021a
[   69.523868] RBP: 000000000000001a R08: 00007fff91aa2a58 R09: 
000000179a034e00
[   69.523871] R10: 000056119a1b0890 R11: 0000000000000206 R12: 
00007fff91aa2a60
[   69.523874] R13: 000056119a1b0890 R14: 0000561199f2021a R15: 
0000000000000001
[   69.523882] task:gst-plugin-scan state:D stack:    0 pid: 2213 ppid:  
2199 flags:0x00004004
[   69.523888] Call Trace:
[   69.523891]  __schedule+0x2ee/0x900
[   69.523897]  schedule+0x4f/0xc0
[   69.523902]  drm_sched_entity_flush+0x17c/0x230 [gpu_sched]
[   69.523912]  ? wait_woken+0x80/0x80
[   69.523918]  drm_sched_entity_destroy+0x18/0x30 [gpu_sched]
[   69.523928]  amdgpu_vm_fini+0x256/0x3d0 [amdgpu]
[   69.524210]  amdgpu_driver_postclose_kms+0x179/0x240 [amdgpu]
[   69.524444]  drm_file_free.part.0+0x1e5/0x250 [drm]
[   69.524481]  ? dma_fence_release+0x140/0x140
[   69.524489]  drm_close_helper.isra.0+0x65/0x70 [drm]
[   69.524524]  drm_release+0x6e/0xf0 [drm]
[   69.524559]  __fput+0x9f/0x250
[   69.524564]  ____fput+0xe/0x10
[   69.524569]  task_work_run+0x70/0xb0
[   69.524575]  exit_to_user_mode_prepare+0x1c8/0x1d0
[   69.524581]  syscall_exit_to_user_mode+0x27/0x50
[   69.524586]  ? __x64_sys_close+0x12/0x40
[   69.524589]  do_syscall_64+0x4d/0xb0
[   69.524594]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   69.524599] RIP: 0033:0x7f2c12adb9ab
[   69.524602] RSP: 002b:00007fff981aaaa0 EFLAGS: 00000293 ORIG_RAX: 
0000000000000003
[   69.524606] RAX: 0000000000000000 RBX: 0000556b6f83f060 RCX: 
00007f2c12adb9ab
[   69.524608] RDX: 0000000000000014 RSI: 0000556b6f841400 RDI: 
0000000000000006
[   69.524611] RBP: 0000556b6f83f100 R08: 0000000000000000 R09: 
000000000000000e
[   69.524613] R10: 00007fff981db090 R11: 0000000000000293 R12: 
0000556b6f841400
[   69.524616] R13: 00007f2c12763e30 R14: 0000556b6f817330 R15: 
00007f2c127420b4

>> Regards,
>>    Felix
>>
>>
>>> Since we can't stop HW/FW/SW right the way after IO page fault
>>> detected, so I can't tell which part try to access
>>> system memory through IOMMU.
>>>
>>> But after moving IOMMU device init before other amdgpu IP init/resume,
>>> the DM crash /IOMMU page fault issues are gone.
>>>
>>> Those patches can't directly explain why the issue fixed, but this new
>>> sequence makes more sense to me.
>>>
>>> Can I have you RB on those patches?
>>>
>>> Thanks!
>>> James
>>>
>>>> Regards,
>>>>    Felix
>>>>
>>>>
>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++
>>>>>   1 file changed, 12 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 653bd8f..e3f0308 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>>   	if (r)
>>>>>   		goto init_failed;
>>>>>   
>>>>> +	r = amdgpu_amdkfd_resume_iommu(adev);
>>>>> +	if (r)
>>>>> +		goto init_failed;
>>>>> +
>>>>>   	r = amdgpu_device_ip_hw_init_phase1(adev);
>>>>>   	if (r)
>>>>>   		goto init_failed;
>>>>> @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev)
>>>>>   {
>>>>>   	int r;
>>>>>   
>>>>> +	r = amdgpu_amdkfd_resume_iommu(adev);
>>>>> +	if (r)
>>>>> +		return r;
>>>>> +
>>>>>   	r = amdgpu_device_ip_resume_phase1(adev);
>>>>>   	if (r)
>>>>>   		return r;
>>>>> @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>   				dev_warn(tmp_adev->dev, "asic atom init failed!");
>>>>>   			} else {
>>>>>   				dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n");
>>>>> +				r = amdgpu_amdkfd_resume_iommu(tmp_adev);
>>>>> +				if (r)
>>>>> +					goto out;
>>>>> +
>>>>>   				r = amdgpu_device_ip_resume_phase1(tmp_adev);
>>>>>   				if (r)
>>>>>   					goto out;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210907/19a0089e/attachment-0001.htm>