<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-04-29 2:10 a.m., Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:15106a30-9f85-f0ca-5e4c-b53c60c83474@amd.com">
      <pre class="moz-quote-pre" wrap="">Am 2021-04-28 um 9:53 p.m. schrieb Philip Yang:

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">If migration vma setup, but failed before start sdma memory copy, e.g.
process is killed, don't wait for sdma fence done.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
I think you could describe this more generally as "Handle errors
returned by svm_migrate_copy_to_vram/ram".


</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">
Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 6b810863f6ba..19b08247ba8a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -460,10 +460,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
        }
 
        if (migrate.cpages) {
-               svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
-                                        scratch);
-               migrate_vma_pages(&migrate);
-               svm_migrate_copy_done(adev, mfence);
+               r = svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
+                                            scratch);
+               if (!r) {
+                       migrate_vma_pages(&migrate);
+                       svm_migrate_copy_done(adev, mfence);
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
I think there are failure cases where svm_migrate_copy_to_vram
successfully copies some pages but fails somewhere in the middle. I
think in those cases you still want to call migrate_vma_pages and
svm_migrate_copy_done. If the copy never started for some reason, there
should be no mfence and svm_migrate_copy_done should be a no-op.

I probably don't understand the failure scenario you encountered. Can
you explain that in more detail?</pre>
    </blockquote>
    <p>I had below backtrace, but cannot repro it again, use ctrl-c to
      kill process while handling GPU retry fault. I will send new patch
      to fix the WARNING, the "amdgpu: qcm fence wait loop timeout
      expired" and hang issue log is something else, not caused by
      svm_migrate_copy_done wait fence.<br>
    </p>
    <p>[   58.822450] VRAM BO missing during validation<br>
      [   58.822488] WARNING: CPU: 3 PID: 2544 at
/home/yangp/git/compute_staging/kernel/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1376
      svm_range_validate_and_map+0xeea/0xf30 [amdgpu]<br>
      [   58.822820] Modules linked in: xt_multiport iptable_filter
      ip6table_filter ip6_tables fuse i2c_piix4 k10temp ip_tables
      x_tables amdgpu iommu_v2 gpu_sched ast drm_vram_helper
      drm_ttm_helper ttm<br>
      [   58.822902] CPU: 3 PID: 2544 Comm: kworker/3:2 Not tainted
      5.11.0-kfd-yangp #1420<br>
      [   58.822912] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00,
      BIOS F12 08/05/2019<br>
      [   58.822918] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]<br>
      [   58.823197] RIP: 0010:svm_range_validate_and_map+0xeea/0xf30
      [amdgpu]<br>
      [   58.823504] Code: 8c b7 41 ec 41 be ea ff ff ff e9 20 fc ff ff
      be 01 00 00 00 e8 57 27 3f ec e9 20 fe ff ff 48 c7 c7 40 7f 61 c0
      e8 d6 54 d7 eb <0f> 0b 41 be ea ff ff ff e9 81 f3 ff ff 89
      c2 48 c7 c6 c8 81 61 c0<br>
      [   58.823513] RSP: 0018:ffffb2f740677850 EFLAGS: 00010286<br>
      [   58.823524] RAX: 0000000000000000 RBX: ffff89a2902aa800 RCX:
      0000000000000027<br>
      [   58.823531] RDX: 0000000000000000 RSI: ffff89a96cc980b0 RDI:
      ffff89a96cc980b8<br>
      [   58.823536] RBP: ffff89a286f9f500 R08: 0000000000000001 R09:
      0000000000000001<br>
      [   58.823542] R10: ffffb2f740677ab8 R11: ffffb2f740677660 R12:
      0000000555558e00<br>
      [   58.823548] R13: ffff89a2902aaca0 R14: ffff89a289209000 R15:
      ffff89a289209000<br>
      [   58.823554] FS:  0000000000000000(0000)
      GS:ffff89a96cc80000(0000) knlGS:0000000000000000<br>
      [   58.823561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
      [   58.823567] CR2: 00007ffff7d91000 CR3: 000000013930e000 CR4:
      00000000003506e0<br>
      [   58.823573] Call Trace:<br>
      [   58.823587]  ? __lock_acquire+0x351/0x1a70<br>
      [   58.823599]  ? __lock_acquire+0x351/0x1a70<br>
      [   58.823614]  ? __lock_acquire+0x351/0x1a70<br>
      [   58.823634]  ? __lock_acquire+0x351/0x1a70<br>
      [   58.823641]  ? __lock_acquire+0x351/0x1a70<br>
      [   58.823663]  ? lock_acquire+0x242/0x390<br>
      [   58.823670]  ? free_one_page+0x3c/0x4b0<br>
      [   58.823687]  ? get_object+0x50/0x50<br>
      [   58.823708]  ? mark_held_locks+0x49/0x70<br>
      [   58.823715]  ? mark_held_locks+0x49/0x70<br>
      [   58.823725]  ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
      [   58.823733]  ? __free_pages_ok+0x360/0x480<br>
      [   58.823753]  ? svm_migrate_ram_to_vram+0x30f/0xa40 [amdgpu]<br>
      [   58.824072]  ? mark_held_locks+0x49/0x70<br>
      [   58.824096]  svm_range_restore_pages+0x608/0x950 [amdgpu]<br>
      [   58.824410]  amdgpu_vm_handle_fault+0xa9/0x3c0 [amdgpu]<br>
      [   58.824673]  gmc_v9_0_process_interrupt+0xa8/0x410 [amdgpu]<br>
      [   58.824945]  ? amdgpu_device_skip_hw_access+0x6b/0x70 [amdgpu]<br>
      [   58.825191]  ? amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]<br>
      [   58.825462]  amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]<br>
      [   58.825743]  amdgpu_ih_process+0x7b/0xe0 [amdgpu]<br>
      [   58.826106]  process_one_work+0x2a2/0x620<br>
      [   58.826146]  ? process_one_work+0x620/0x620<br>
      [   58.826165]  worker_thread+0x39/0x3f0<br>
      [   58.826188]  ? process_one_work+0x620/0x620<br>
      [   58.826205]  kthread+0x131/0x150<br>
      [   58.826223]  ? kthread_park+0x90/0x90<br>
      [   58.826245]  ret_from_fork+0x1f/0x30<br>
      [   58.826292] irq event stamp: 2358517<br>
      [   58.826301] hardirqs last  enabled at (2358523):
      [<ffffffffac100657>] console_unlock+0x487/0x580<br>
      [   58.826313] hardirqs last disabled at (2358528):
      [<ffffffffac1005b3>] console_unlock+0x3e3/0x580<br>
      [   58.826326] softirqs last  enabled at (2358470):
      [<ffffffffad000306>] __do_softirq+0x306/0x429<br>
      [   58.826341] softirqs last disabled at (2358449):
      [<fffffffface00f8f>] asm_call_irq_on_stack+0xf/0x20<br>
      [   58.826355] ---[ end trace ddec9ce1cb4ea7fc ]---<br>
      [   67.807478] amdgpu: qcm fence wait loop timeout expired<br>
      [  242.302930] INFO: task khugepaged:514 blocked for more than 120
      seconds.<br>
      [  242.303237]       Tainted: G        W         5.11.0-kfd-yangp
      #1420<br>
      [  242.303248] "echo 0 >
      /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
      [  242.303256] task:khugepaged      state:D stack:    0 pid:  514
      ppid:     2 flags:0x00004000<br>
      [  242.303270] Call Trace:<br>
      [  242.303281]  __schedule+0x31a/0x9f0<br>
      [  242.303300]  ? wait_for_completion+0x87/0x120<br>
      [  242.303310]  schedule+0x51/0xc0<br>
      [  242.303318]  schedule_timeout+0x193/0x360<br>
      [  242.303331]  ? mark_held_locks+0x49/0x70<br>
      [  242.303339]  ? mark_held_locks+0x49/0x70<br>
      [  242.303347]  ? wait_for_completion+0x87/0x120<br>
      [  242.303354]  ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
      [  242.303364]  ? wait_for_completion+0x87/0x120<br>
      [  242.303372]  wait_for_completion+0xba/0x120<br>
      [  242.303385]  __flush_work+0x273/0x480<br>
      [  242.303398]  ? flush_workqueue_prep_pwqs+0x140/0x140<br>
      [  242.303423]  ? lru_add_drain+0x110/0x110<br>
      [  242.303434]  lru_add_drain_all+0x172/0x1e0<br>
      [  242.303447]  khugepaged+0x68/0x2d10<br>
      [  242.303481]  ? wait_woken+0xa0/0xa0<br>
      [  242.303496]  ? collapse_pte_mapped_thp+0x3f0/0x3f0<br>
      [  242.303503]  kthread+0x131/0x150<br>
      [  242.303512]  ? kthread_park+0x90/0x90<br>
      [  242.303523]  ret_from_fork+0x1f/0x30<br>
      [  242.303665] <br>
                     Showing all locks held in the system:<br>
      [  242.303679] 1 lock held by khungtaskd/508:<br>
      [  242.303684]  #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
      at: debug_show_all_locks+0xe/0x1b0<br>
      [  242.303713] 1 lock held by khugepaged/514:<br>
      [  242.303718]  #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
      lru_add_drain_all+0x37/0x1e0<br>
      [  242.303756] 6 locks held by kworker/3:2/2544:<br>
      [  242.303764] 1 lock held by in:imklog/2733:<br>
      [  242.303769]  #0: ffff89a2928e58f0
      (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50<br>
      [  242.303838] 1 lock held by dmesg/4262:<br>
      [  242.303843]  #0: ffff89a3079980d0
      (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0<br>
      <br>
      [  242.303875] =============================================<br>
      <br>
      [  311.585542] loop0: detected capacity change from 8 to 0<br>
      [  363.135280] INFO: task khugepaged:514 blocked for more than 241
      seconds.<br>
      [  363.135304]       Tainted: G        W         5.11.0-kfd-yangp
      #1420<br>
      [  363.135313] "echo 0 >
      /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
      [  363.135321] task:khugepaged      state:D stack:    0 pid:  514
      ppid:     2 flags:0x00004000<br>
      [  363.135336] Call Trace:<br>
      [  363.135347]  __schedule+0x31a/0x9f0<br>
      [  363.135365]  ? wait_for_completion+0x87/0x120<br>
      [  363.135375]  schedule+0x51/0xc0<br>
      [  363.135383]  schedule_timeout+0x193/0x360<br>
      [  363.135395]  ? mark_held_locks+0x49/0x70<br>
      [  363.135403]  ? mark_held_locks+0x49/0x70<br>
      [  363.135412]  ? wait_for_completion+0x87/0x120<br>
      [  363.135419]  ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
      [  363.135428]  ? wait_for_completion+0x87/0x120<br>
      [  363.135436]  wait_for_completion+0xba/0x120<br>
      [  363.135448]  __flush_work+0x273/0x480<br>
      [  363.135462]  ? flush_workqueue_prep_pwqs+0x140/0x140<br>
      [  363.135486]  ? lru_add_drain+0x110/0x110<br>
      [  363.135498]  lru_add_drain_all+0x172/0x1e0<br>
      [  363.135511]  khugepaged+0x68/0x2d10<br>
      [  363.135544]  ? wait_woken+0xa0/0xa0<br>
      [  363.135558]  ? collapse_pte_mapped_thp+0x3f0/0x3f0<br>
      [  363.135566]  kthread+0x131/0x150<br>
      [  363.135575]  ? kthread_park+0x90/0x90<br>
      [  363.135586]  ret_from_fork+0x1f/0x30<br>
      [  363.135718] <br>
                     Showing all locks held in the system:<br>
      [  363.135731] 1 lock held by khungtaskd/508:<br>
      [  363.135737]  #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
      at: debug_show_all_locks+0xe/0x1b0<br>
      [  363.135765] 1 lock held by khugepaged/514:<br>
      [  363.135771]  #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
      lru_add_drain_all+0x37/0x1e0<br>
      [  363.135810] 5 locks held by kworker/3:2/2544:<br>
      [  363.135818] 1 lock held by in:imklog/2733:<br>
      [  363.135823]  #0: ffff89a2928e58f0
      (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50<br>
      [  363.135887] 1 lock held by dmesg/4262:<br>
      [  363.135892]  #0: ffff89a3079980d0
      (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0<br>
    </p>
    <blockquote type="cite" cite="mid:15106a30-9f85-f0ca-5e4c-b53c60c83474@amd.com">
      <pre class="moz-quote-pre" wrap="">

Thanks,
  Felix


</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">+          }
                migrate_vma_finalize(&migrate);
        }
 
@@ -663,10 +665,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
        pr_debug("cpages %ld\n", migrate.cpages);
 
        if (migrate.cpages) {
-               svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
-                                       scratch);
-               migrate_vma_pages(&migrate);
-               svm_migrate_copy_done(adev, mfence);
+               r = svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
+                                           scratch);
+               if (!r) {
+                       migrate_vma_pages(&migrate);
+                       svm_migrate_copy_done(adev, mfence);
+               }
                migrate_vma_finalize(&migrate);
        } else {
                pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n",
</pre>
      </blockquote>
    </blockquote>
  </body>
</html>