<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-09-07 1:53 p.m., Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:ac2bef7a-d70c-8ede-bcca-d8e27a8fdcb6@amd.com">
      <pre class="moz-quote-pre" wrap="">Am 2021-09-07 um 1:51 p.m. schrieb Felix Kuehling:
</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">Am 2021-09-07 um 1:22 p.m. schrieb James Zhu:
</pre>
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">On 2021-09-07 12:48 p.m., Felix Kuehling wrote:
</pre>
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">Separate iommu_resume from kfd_resume, and move it before
other amdgpu ip init/resume.

Fixed Bugzilla: <a class="moz-txt-link-freetext" href="https://bugzilla.kernel.org/show_bug.cgi?id=211277">https://bugzilla.kernel.org/show_bug.cgi?id=211277</a>
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">I think the change is OK. But I don't understand how the IOMMUv2
initialization sequence could affect a crash in DM. The display should
not depend on IOMMUv2 at all. What am I missing?
</pre>
          </blockquote>
          <pre class="moz-quote-pre" wrap="">[JZ] It is a weird issue. disable VCN IP block or disable gpu_off
feature, or set pci=noats, all

can fix DM crash. Also the issue occurred quite random, some time
after few suspend/resume cycle,

some times after few hundreds S/R cycles. the maximum that I saw is
2422 S/R cycles.

But every time DM crash, I can see one or two iommu errors ahead:

*AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=****
flags=0x0070]*
</pre>
        </blockquote>
        <pre class="moz-quote-pre" wrap="">This error is not from IOMMUv2 doing GVA to GPA translations. It's from
IOMMUv1 doing GPA to SPA translation. This error points to an invalid
physical (GVA) address being used by the GPU to access random system

Oops: s/GVA/GPA
memory it shouldn't be accessing (because there is no valid DMA mapping).

On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is
enabled. It's possible that the earlier initialization of IOMMUv2 hides
the problem by putting the IOMMU into passthrough mode. I don't think
this patch series is a valid solution.</pre>
      </blockquote>
    </blockquote>
    <p>[JZ] Good to know, thanks! So amd_iommu_init_device is for v2
      only.</p>
    <p>And it is supposed to be safe to do amd_iommu_init_device after
      amdgpu IP init/resume without any interference.<br>
    </p>
    <blockquote type="cite" cite="mid:ac2bef7a-d70c-8ede-bcca-d8e27a8fdcb6@amd.com">
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">You can probably fix the problem with this kernel boot parameter: iommu=pt</pre>
      </blockquote>
    </blockquote>
    <p>[JZ] Still not working after apply <b>iommu=pt</b></p>
    <p>BOOT_IMAGE=/boot/vmlinuz-5.8.0-41-generic
      root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
      modprobe.blacklist=amdgpu <b>iommu=pt</b> 3
<br>
      [    0.612117] iommu: Default domain type: <b>Passthrough</b>
      (set via kernel command line)
<br>
      [  354.067871] amdgpu 0000:04:00.0: AMD-Vi: Event logged [<b>IO_PAGE_FAULT</b>
      domain=0x0000 address=0x32de00040 flags=0x0070]
<br>
      [  354.067884] amdgpu 0000:04:00.0: AMD-Vi: Event logged
      [IO_PAGE_FAULT domain=0x0000 address=0x32de40000 flags=0x0070]</p>
    <blockquote type="cite" cite="mid:ac2bef7a-d70c-8ede-bcca-d8e27a8fdcb6@amd.com">
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">
And you can probably reproduce it even with this patch series if instead
you add: iommu=nopt amd_iommu=force_isolation</pre>
      </blockquote>
    </blockquote>
    <p>[JZ] could not set both <b>iommu=nopt </b>and<b>  amd</b>_<b>iommu=force_isolation
      </b>together<b>. </b>(does it mean something?)<b><br>
      </b></p>
    <p>BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
      root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
      modprobe.blacklist=amdgpu<b> iommu=nopt amd_iommu=force_isolation</b>
      3
<br>
      [    0.294242] iommu: Default domain type: Translated (set via
      kernel command line)
<br>
      [    0.350675] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4
      counters/bank).
<br>
      [  106.403927] amdgpu 0000:04:00.0: amdgpu:
      amdgpu_device_ip_resume failed (-6).
<br>
      [  106.403931] PM: dpm_run_callback(): pci_pm_resume+0x0/0x90
      returns -6
<br>
      [  106.403941] amdgpu 0000:04:00.0: PM: failed to resume async:
      error -6</p>
    <p> <b>iommu=nopt</b><b>: </b>Passed at least 200 S/R cycles<br>
    </p>
    <p>BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
      root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
      modprobe.blacklist=amdgpu   <b>iommu=nopt</b> 3
<br>
      [    0.294242] iommu: Default domain type: Translated (set via
      kernel command line)
<br>
    </p>
    <p><b>amd_iommu=force_isolation</b>: failed at 1st resume<br>
    </p>
    <p>BOOT_IMAGE=/boot/vmlinuz-5.13.0-custom+
      root=UUID=030a18fe-22f0-49be-818f-192093d543b5 quiet splash
      modprobe.blacklist=amdgpu <b>amd_iommu=force_isolation</b>   3
<br>
      [    0.294242] iommu: Default domain type: Translated <br>
    </p>
    <p>[   49.513262] PM: suspend entry (deep)<br>
      [   49.514404] Filesystems sync: 0.001 seconds<br>
      [   49.514668] Freezing user space processes ... <br>
      [   69.523111] Freezing of tasks failed after 20.008 seconds (2
      tasks refusing to freeze, wq_busy=0):<br>
      [   69.523163] task:gnome-shell     state:D stack:    0 pid: 2196
      ppid:  2108 flags:0x00000004<br>
      [   69.523172] Call Trace:<br>
      [   69.523182]  __schedule+0x2ee/0x900<br>
      [   69.523193]  ? __mod_memcg_lruvec_state+0x22/0xe0<br>
      [   69.523204]  schedule+0x4f/0xc0<br>
      [   69.523214]  drm_sched_entity_flush+0x17c/0x230 [gpu_sched]<br>
      [   69.523225]  ? wait_woken+0x80/0x80<br>
      [   69.523233]  amdgpu_ctx_mgr_entity_flush+0x97/0xf0 [amdgpu]<br>
      [   69.523517]  amdgpu_flush+0x2b/0x50 [amdgpu]<br>
      [   69.523773]  filp_close+0x37/0x70<br>
      [   69.523780]  do_close_on_exec+0xda/0x110<br>
      [   69.523787]  begin_new_exec+0x59d/0xa40<br>
      [   69.523793]  load_elf_binary+0x144/0x1720<br>
      [   69.523801]  ? __kernel_read+0x1a0/0x2d0<br>
      [   69.523807]  ? __kernel_read+0x1a0/0x2d0<br>
      [   69.523812]  ? aa_get_task_label+0x49/0xd0<br>
      [   69.523820]  bprm_execve+0x288/0x680<br>
      [   69.523826]  do_execveat_common.isra.0+0x189/0x1c0<br>
      [   69.523831]  __x64_sys_execve+0x37/0x50<br>
      [   69.523836]  do_syscall_64+0x40/0xb0<br>
      [   69.523843]  entry_SYSCALL_64_after_hwframe+0x44/0xae<br>
      [   69.523851] RIP: 0033:0x7ff1244132fb<br>
      [   69.523856] RSP: 002b:00007fff91a9f2b8 EFLAGS: 00000206
      ORIG_RAX: 000000000000003b<br>
      [   69.523862] RAX: ffffffffffffffda RBX: 00007ff11ee2e180 RCX:
      00007ff1244132fb<br>
      [   69.523866] RDX: 0000561199f5bc00 RSI: 000056119a1b0890 RDI:
      0000561199f2021a<br>
      [   69.523868] RBP: 000000000000001a R08: 00007fff91aa2a58 R09:
      000000179a034e00<br>
      [   69.523871] R10: 000056119a1b0890 R11: 0000000000000206 R12:
      00007fff91aa2a60<br>
      [   69.523874] R13: 000056119a1b0890 R14: 0000561199f2021a R15:
      0000000000000001<br>
      [   69.523882] task:gst-plugin-scan state:D stack:    0 pid: 2213
      ppid:  2199 flags:0x00004004<br>
      [   69.523888] Call Trace:<br>
      [   69.523891]  __schedule+0x2ee/0x900<br>
      [   69.523897]  schedule+0x4f/0xc0<br>
      [   69.523902]  drm_sched_entity_flush+0x17c/0x230 [gpu_sched]<br>
      [   69.523912]  ? wait_woken+0x80/0x80<br>
      [   69.523918]  drm_sched_entity_destroy+0x18/0x30 [gpu_sched]<br>
      [   69.523928]  amdgpu_vm_fini+0x256/0x3d0 [amdgpu]<br>
      [   69.524210]  amdgpu_driver_postclose_kms+0x179/0x240 [amdgpu]<br>
      [   69.524444]  drm_file_free.part.0+0x1e5/0x250 [drm]<br>
      [   69.524481]  ? dma_fence_release+0x140/0x140<br>
      [   69.524489]  drm_close_helper.isra.0+0x65/0x70 [drm]<br>
      [   69.524524]  drm_release+0x6e/0xf0 [drm]<br>
      [   69.524559]  __fput+0x9f/0x250<br>
      [   69.524564]  ____fput+0xe/0x10<br>
      [   69.524569]  task_work_run+0x70/0xb0<br>
      [   69.524575]  exit_to_user_mode_prepare+0x1c8/0x1d0<br>
      [   69.524581]  syscall_exit_to_user_mode+0x27/0x50<br>
      [   69.524586]  ? __x64_sys_close+0x12/0x40<br>
      [   69.524589]  do_syscall_64+0x4d/0xb0<br>
      [   69.524594]  entry_SYSCALL_64_after_hwframe+0x44/0xae<br>
      [   69.524599] RIP: 0033:0x7f2c12adb9ab<br>
      [   69.524602] RSP: 002b:00007fff981aaaa0 EFLAGS: 00000293
      ORIG_RAX: 0000000000000003<br>
      [   69.524606] RAX: 0000000000000000 RBX: 0000556b6f83f060 RCX:
      00007f2c12adb9ab<br>
      [   69.524608] RDX: 0000000000000014 RSI: 0000556b6f841400 RDI:
      0000000000000006<br>
      [   69.524611] RBP: 0000556b6f83f100 R08: 0000000000000000 R09:
      000000000000000e<br>
      [   69.524613] R10: 00007fff981db090 R11: 0000000000000293 R12:
      0000556b6f841400<br>
      [   69.524616] R13: 00007f2c12763e30 R14: 0000556b6f817330 R15:
      00007f2c127420b4<br>
    </p>
    <blockquote type="cite" cite="mid:ac2bef7a-d70c-8ede-bcca-d8e27a8fdcb6@amd.com">
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">
Regards,
  Felix


</pre>
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">Since we can't stop HW/FW/SW right the way after IO page fault
detected, so I can't tell which part try to access
system memory through IOMMU.

But after moving IOMMU device init before other amdgpu IP init/resume,
the DM crash /IOMMU page fault issues are gone.

Those patches can't directly explain why the issue fixed, but this new
sequence makes more sense to me.

Can I have you RB on those patches?

Thanks!
James

</pre>
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">Regards,
  Felix


</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">Signed-off-by: James Zhu <a class="moz-txt-link-rfc2396E" href="mailto:James.Zhu@amd.com"><James.Zhu@amd.com></a>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 653bd8f..e3f0308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
        if (r)
                goto init_failed;
 
+       r = amdgpu_amdkfd_resume_iommu(adev);
+       if (r)
+               goto init_failed;
+
        r = amdgpu_device_ip_hw_init_phase1(adev);
        if (r)
                goto init_failed;
@@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev)
 {
        int r;
 
+       r = amdgpu_amdkfd_resume_iommu(adev);
+       if (r)
+               return r;
+
        r = amdgpu_device_ip_resume_phase1(adev);
        if (r)
                return r;
@@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
                                dev_warn(tmp_adev->dev, "asic atom init failed!");
                        } else {
                                dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n");
+                               r = amdgpu_amdkfd_resume_iommu(tmp_adev);
+                               if (r)
+                                       goto out;
+
                                r = amdgpu_device_ip_resume_phase1(tmp_adev);
                                if (r)
                                        goto out;
</pre>
            </blockquote>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>