[PATCH 3/3] drm/amdgpu: move iommu_resume before ip init/resume
James Zhu
jamesz at amd.com
Tue Sep 7 20:30:04 UTC 2021
On 2021-09-07 1:53 p.m., Felix Kuehling wrote:
> Am 2021-09-07 um 1:51 p.m. schrieb Felix Kuehling:
>> Am 2021-09-07 um 1:22 p.m. schrieb James Zhu:
>>> On 2021-09-07 12:48 p.m., Felix Kuehling wrote:
>>>> Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:
>>>>> Separate iommu_resume from kfd_resume, and move it before
>>>>> other amdgpu ip init/resume.
>>>>>
>>>>> Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277
>>>> I think the change is OK. But I don't understand how the IOMMUv2
>>>> initialization sequence could affect a crash in DM. The display should
>>>> not depend on IOMMUv2 at all. What am I missing?
>>> [JZ] It is a weird issue. disable VCN IP block or disable gpu_off
>>> feature, or set pci=noats, all
>>>
>>> can fix DM crash. Also the issue occurred quite random, some time
>>> after few suspend/resume cycle,
>>>
>>> some times after few hundreds S/R cycles. the maximum that I saw is
>>> 2422 S/R cycles.
>>>
>>> But every time DM crash, I can see one or two iommu errors ahead:
>>>
>>> *AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=****
>>> flags=0x0070]*
>> This error is not from IOMMUv2 doing GVA to GPA translations. It's from
>> IOMMUv1 doing GPA to SPA translation. This error points to an invalid
>> physical (GVA) address being used by the GPU to access random system
>>
>> Oops: s/GVA/GPA
>> memory it shouldn't be accessing (because there is no valid DMA mapping).
>>
>> On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is
>> enabled. It's possible that the earlier initialization of IOMMUv2 hides
>> the problem by putting the IOMMU into passthrough mode. I don't think
>> this patch series is a valid solution.
[JZ] Good to know, thanks! So amd_iommu_init_device is for v2 only.
And it is supposed to be safe to do amd_iommu_init_device after amdgpu
IP init/resume without any interference.
>> You can probably fix the problem with this kernel boot parameter: iommu=pt
[JZ] Still not working after apply *iommu=pt*
BOOT_IMAGE=/boot/vmlinuz-5.8.0-41-generic
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
modprobe.blacklist=amdgpu *iommu=pt* 3
[ 0.612117] iommu: Default domain type: *Passthrough* (set via kernel
command line)
[ 354.067871] amdgpu 0000:04:00.0: AMD-Vi: Event logged
[*IO_PAGE_FAULT* domain=0x0000 address=0x32de00040 flags=0x0070]
[ 354.067884] amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0000 address=0x32de40000 flags=0x0070]
>> And you can probably reproduce it even with this patch series if instead
>> you add: iommu=nopt amd_iommu=force_isolation
[JZ] could not set both *iommu=nopt *and* amd*_*iommu=force_isolation
*together*. *(does it mean something?)*
*
BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
modprobe.blacklist=amdgpu*iommu=nopt amd_iommu=force_isolation* 3
[ 0.294242] iommu: Default domain type: Translated (set via kernel
command line)
[ 0.350675] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4
counters/bank).
[ 106.403927] amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_resume
failed (-6).
[ 106.403931] PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -6
[ 106.403941] amdgpu 0000:04:00.0: PM: failed to resume async: error -6
*iommu=nopt**: *Passed at least 200 S/R cycles
BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
modprobe.blacklist=amdgpu *iommu=nopt* 3
[ 0.294242] iommu: Default domain type: Translated (set via kernel
command line)
*amd_iommu=force_isolation*: failed at 1st resume
BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
modprobe.blacklist=amdgpu *amd_iommu=force_isolation* 3
[ 0.294242] iommu: Default domain type: Translated
[ 49.513262] PM: suspend entry (deep)
[ 49.514404] Filesystems sync: 0.001 seconds
[ 49.514668] Freezing user space processes ...
[ 69.523111] Freezing of tasks failed after 20.008 seconds (2 tasks
refusing to freeze, wq_busy=0):
[ 69.523163] task:gnome-shell state:D stack: 0 pid: 2196 ppid:
2108 flags:0x00000004
[ 69.523172] Call Trace:
[ 69.523182] __schedule+0x2ee/0x900
[ 69.523193] ? __mod_memcg_lruvec_state+0x22/0xe0
[ 69.523204] schedule+0x4f/0xc0
[ 69.523214] drm_sched_entity_flush+0x17c/0x230 [gpu_sched]
[ 69.523225] ? wait_woken+0x80/0x80
[ 69.523233] amdgpu_ctx_mgr_entity_flush+0x97/0xf0 [amdgpu]
[ 69.523517] amdgpu_flush+0x2b/0x50 [amdgpu]
[ 69.523773] filp_close+0x37/0x70
[ 69.523780] do_close_on_exec+0xda/0x110
[ 69.523787] begin_new_exec+0x59d/0xa40
[ 69.523793] load_elf_binary+0x144/0x1720
[ 69.523801] ? __kernel_read+0x1a0/0x2d0
[ 69.523807] ? __kernel_read+0x1a0/0x2d0
[ 69.523812] ? aa_get_task_label+0x49/0xd0
[ 69.523820] bprm_execve+0x288/0x680
[ 69.523826] do_execveat_common.isra.0+0x189/0x1c0
[ 69.523831] __x64_sys_execve+0x37/0x50
[ 69.523836] do_syscall_64+0x40/0xb0
[ 69.523843] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 69.523851] RIP: 0033:0x7ff1244132fb
[ 69.523856] RSP: 002b:00007fff91a9f2b8 EFLAGS: 00000206 ORIG_RAX:
000000000000003b
[ 69.523862] RAX: ffffffffffffffda RBX: 00007ff11ee2e180 RCX:
00007ff1244132fb
[ 69.523866] RDX: 0000561199f5bc00 RSI: 000056119a1b0890 RDI:
0000561199f2021a
[ 69.523868] RBP: 000000000000001a R08: 00007fff91aa2a58 R09:
000000179a034e00
[ 69.523871] R10: 000056119a1b0890 R11: 0000000000000206 R12:
00007fff91aa2a60
[ 69.523874] R13: 000056119a1b0890 R14: 0000561199f2021a R15:
0000000000000001
[ 69.523882] task:gst-plugin-scan state:D stack: 0 pid: 2213 ppid:
2199 flags:0x00004004
[ 69.523888] Call Trace:
[ 69.523891] __schedule+0x2ee/0x900
[ 69.523897] schedule+0x4f/0xc0
[ 69.523902] drm_sched_entity_flush+0x17c/0x230 [gpu_sched]
[ 69.523912] ? wait_woken+0x80/0x80
[ 69.523918] drm_sched_entity_destroy+0x18/0x30 [gpu_sched]
[ 69.523928] amdgpu_vm_fini+0x256/0x3d0 [amdgpu]
[ 69.524210] amdgpu_driver_postclose_kms+0x179/0x240 [amdgpu]
[ 69.524444] drm_file_free.part.0+0x1e5/0x250 [drm]
[ 69.524481] ? dma_fence_release+0x140/0x140
[ 69.524489] drm_close_helper.isra.0+0x65/0x70 [drm]
[ 69.524524] drm_release+0x6e/0xf0 [drm]
[ 69.524559] __fput+0x9f/0x250
[ 69.524564] ____fput+0xe/0x10
[ 69.524569] task_work_run+0x70/0xb0
[ 69.524575] exit_to_user_mode_prepare+0x1c8/0x1d0
[ 69.524581] syscall_exit_to_user_mode+0x27/0x50
[ 69.524586] ? __x64_sys_close+0x12/0x40
[ 69.524589] do_syscall_64+0x4d/0xb0
[ 69.524594] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 69.524599] RIP: 0033:0x7f2c12adb9ab
[ 69.524602] RSP: 002b:00007fff981aaaa0 EFLAGS: 00000293 ORIG_RAX:
0000000000000003
[ 69.524606] RAX: 0000000000000000 RBX: 0000556b6f83f060 RCX:
00007f2c12adb9ab
[ 69.524608] RDX: 0000000000000014 RSI: 0000556b6f841400 RDI:
0000000000000006
[ 69.524611] RBP: 0000556b6f83f100 R08: 0000000000000000 R09:
000000000000000e
[ 69.524613] R10: 00007fff981db090 R11: 0000000000000293 R12:
0000556b6f841400
[ 69.524616] R13: 00007f2c12763e30 R14: 0000556b6f817330 R15:
00007f2c127420b4
>> Regards,
>> Felix
>>
>>
>>> Since we can't stop HW/FW/SW right the way after IO page fault
>>> detected, so I can't tell which part try to access
>>> system memory through IOMMU.
>>>
>>> But after moving IOMMU device init before other amdgpu IP init/resume,
>>> the DM crash /IOMMU page fault issues are gone.
>>>
>>> Those patches can't directly explain why the issue fixed, but this new
>>> sequence makes more sense to me.
>>>
>>> Can I have you RB on those patches?
>>>
>>> Thanks!
>>> James
>>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++
>>>>> 1 file changed, 12 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 653bd8f..e3f0308 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>> if (r)
>>>>> goto init_failed;
>>>>>
>>>>> + r = amdgpu_amdkfd_resume_iommu(adev);
>>>>> + if (r)
>>>>> + goto init_failed;
>>>>> +
>>>>> r = amdgpu_device_ip_hw_init_phase1(adev);
>>>>> if (r)
>>>>> goto init_failed;
>>>>> @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev)
>>>>> {
>>>>> int r;
>>>>>
>>>>> + r = amdgpu_amdkfd_resume_iommu(adev);
>>>>> + if (r)
>>>>> + return r;
>>>>> +
>>>>> r = amdgpu_device_ip_resume_phase1(adev);
>>>>> if (r)
>>>>> return r;
>>>>> @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>> dev_warn(tmp_adev->dev, "asic atom init failed!");
>>>>> } else {
>>>>> dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n");
>>>>> + r = amdgpu_amdkfd_resume_iommu(tmp_adev);
>>>>> + if (r)
>>>>> + goto out;
>>>>> +
>>>>> r = amdgpu_device_ip_resume_phase1(tmp_adev);
>>>>> if (r)
>>>>> goto out;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210907/19a0089e/attachment-0001.htm>
More information about the amd-gfx
mailing list