<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2021-04-29 2:10 a.m., Felix Kuehling
wrote:<br>
</div>
<blockquote type="cite" cite="mid:15106a30-9f85-f0ca-5e4c-b53c60c83474@amd.com">
<pre class="moz-quote-pre" wrap="">Am 2021-04-28 um 9:53 p.m. schrieb Philip Yang:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">If migration vma setup, but failed before start sdma memory copy, e.g.
process is killed, don't wait for sdma fence done.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I think you could describe this more generally as "Handle errors
returned by svm_migrate_copy_to_vram/ram".
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 6b810863f6ba..19b08247ba8a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -460,10 +460,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
}
if (migrate.cpages) {
- svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
- scratch);
- migrate_vma_pages(&migrate);
- svm_migrate_copy_done(adev, mfence);
+ r = svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
+ scratch);
+ if (!r) {
+ migrate_vma_pages(&migrate);
+ svm_migrate_copy_done(adev, mfence);
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I think there are failure cases where svm_migrate_copy_to_vram
successfully copies some pages but fails somewhere in the middle. I
think in those cases you still want to call migrate_vma_pages and
svm_migrate_copy_done. If the copy never started for some reason, there
should be no mfence and svm_migrate_copy_done should be a no-op.
I probably don't understand the failure scenario you encountered. Can
you explain that in more detail?</pre>
</blockquote>
<p>I had below backtrace, but cannot repro it again, use ctrl-c to
kill process while handling GPU retry fault. I will send new patch
to fix the WARNING, the "amdgpu: qcm fence wait loop timeout
expired" and hang issue log is something else, not caused by
svm_migrate_copy_done wait fence.<br>
</p>
<p>[ 58.822450] VRAM BO missing during validation<br>
[ 58.822488] WARNING: CPU: 3 PID: 2544 at
/home/yangp/git/compute_staging/kernel/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1376
svm_range_validate_and_map+0xeea/0xf30 [amdgpu]<br>
[ 58.822820] Modules linked in: xt_multiport iptable_filter
ip6table_filter ip6_tables fuse i2c_piix4 k10temp ip_tables
x_tables amdgpu iommu_v2 gpu_sched ast drm_vram_helper
drm_ttm_helper ttm<br>
[ 58.822902] CPU: 3 PID: 2544 Comm: kworker/3:2 Not tainted
5.11.0-kfd-yangp #1420<br>
[ 58.822912] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00,
BIOS F12 08/05/2019<br>
[ 58.822918] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]<br>
[ 58.823197] RIP: 0010:svm_range_validate_and_map+0xeea/0xf30
[amdgpu]<br>
[ 58.823504] Code: 8c b7 41 ec 41 be ea ff ff ff e9 20 fc ff ff
be 01 00 00 00 e8 57 27 3f ec e9 20 fe ff ff 48 c7 c7 40 7f 61 c0
e8 d6 54 d7 eb <0f> 0b 41 be ea ff ff ff e9 81 f3 ff ff 89
c2 48 c7 c6 c8 81 61 c0<br>
[ 58.823513] RSP: 0018:ffffb2f740677850 EFLAGS: 00010286<br>
[ 58.823524] RAX: 0000000000000000 RBX: ffff89a2902aa800 RCX:
0000000000000027<br>
[ 58.823531] RDX: 0000000000000000 RSI: ffff89a96cc980b0 RDI:
ffff89a96cc980b8<br>
[ 58.823536] RBP: ffff89a286f9f500 R08: 0000000000000001 R09:
0000000000000001<br>
[ 58.823542] R10: ffffb2f740677ab8 R11: ffffb2f740677660 R12:
0000000555558e00<br>
[ 58.823548] R13: ffff89a2902aaca0 R14: ffff89a289209000 R15:
ffff89a289209000<br>
[ 58.823554] FS: 0000000000000000(0000)
GS:ffff89a96cc80000(0000) knlGS:0000000000000000<br>
[ 58.823561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
[ 58.823567] CR2: 00007ffff7d91000 CR3: 000000013930e000 CR4:
00000000003506e0<br>
[ 58.823573] Call Trace:<br>
[ 58.823587] ? __lock_acquire+0x351/0x1a70<br>
[ 58.823599] ? __lock_acquire+0x351/0x1a70<br>
[ 58.823614] ? __lock_acquire+0x351/0x1a70<br>
[ 58.823634] ? __lock_acquire+0x351/0x1a70<br>
[ 58.823641] ? __lock_acquire+0x351/0x1a70<br>
[ 58.823663] ? lock_acquire+0x242/0x390<br>
[ 58.823670] ? free_one_page+0x3c/0x4b0<br>
[ 58.823687] ? get_object+0x50/0x50<br>
[ 58.823708] ? mark_held_locks+0x49/0x70<br>
[ 58.823715] ? mark_held_locks+0x49/0x70<br>
[ 58.823725] ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
[ 58.823733] ? __free_pages_ok+0x360/0x480<br>
[ 58.823753] ? svm_migrate_ram_to_vram+0x30f/0xa40 [amdgpu]<br>
[ 58.824072] ? mark_held_locks+0x49/0x70<br>
[ 58.824096] svm_range_restore_pages+0x608/0x950 [amdgpu]<br>
[ 58.824410] amdgpu_vm_handle_fault+0xa9/0x3c0 [amdgpu]<br>
[ 58.824673] gmc_v9_0_process_interrupt+0xa8/0x410 [amdgpu]<br>
[ 58.824945] ? amdgpu_device_skip_hw_access+0x6b/0x70 [amdgpu]<br>
[ 58.825191] ? amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]<br>
[ 58.825462] amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]<br>
[ 58.825743] amdgpu_ih_process+0x7b/0xe0 [amdgpu]<br>
[ 58.826106] process_one_work+0x2a2/0x620<br>
[ 58.826146] ? process_one_work+0x620/0x620<br>
[ 58.826165] worker_thread+0x39/0x3f0<br>
[ 58.826188] ? process_one_work+0x620/0x620<br>
[ 58.826205] kthread+0x131/0x150<br>
[ 58.826223] ? kthread_park+0x90/0x90<br>
[ 58.826245] ret_from_fork+0x1f/0x30<br>
[ 58.826292] irq event stamp: 2358517<br>
[ 58.826301] hardirqs last enabled at (2358523):
[<ffffffffac100657>] console_unlock+0x487/0x580<br>
[ 58.826313] hardirqs last disabled at (2358528):
[<ffffffffac1005b3>] console_unlock+0x3e3/0x580<br>
[ 58.826326] softirqs last enabled at (2358470):
[<ffffffffad000306>] __do_softirq+0x306/0x429<br>
[ 58.826341] softirqs last disabled at (2358449):
[<fffffffface00f8f>] asm_call_irq_on_stack+0xf/0x20<br>
[ 58.826355] ---[ end trace ddec9ce1cb4ea7fc ]---<br>
[ 67.807478] amdgpu: qcm fence wait loop timeout expired<br>
[ 242.302930] INFO: task khugepaged:514 blocked for more than 120
seconds.<br>
[ 242.303237] Tainted: G W 5.11.0-kfd-yangp
#1420<br>
[ 242.303248] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
[ 242.303256] task:khugepaged state:D stack: 0 pid: 514
ppid: 2 flags:0x00004000<br>
[ 242.303270] Call Trace:<br>
[ 242.303281] __schedule+0x31a/0x9f0<br>
[ 242.303300] ? wait_for_completion+0x87/0x120<br>
[ 242.303310] schedule+0x51/0xc0<br>
[ 242.303318] schedule_timeout+0x193/0x360<br>
[ 242.303331] ? mark_held_locks+0x49/0x70<br>
[ 242.303339] ? mark_held_locks+0x49/0x70<br>
[ 242.303347] ? wait_for_completion+0x87/0x120<br>
[ 242.303354] ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
[ 242.303364] ? wait_for_completion+0x87/0x120<br>
[ 242.303372] wait_for_completion+0xba/0x120<br>
[ 242.303385] __flush_work+0x273/0x480<br>
[ 242.303398] ? flush_workqueue_prep_pwqs+0x140/0x140<br>
[ 242.303423] ? lru_add_drain+0x110/0x110<br>
[ 242.303434] lru_add_drain_all+0x172/0x1e0<br>
[ 242.303447] khugepaged+0x68/0x2d10<br>
[ 242.303481] ? wait_woken+0xa0/0xa0<br>
[ 242.303496] ? collapse_pte_mapped_thp+0x3f0/0x3f0<br>
[ 242.303503] kthread+0x131/0x150<br>
[ 242.303512] ? kthread_park+0x90/0x90<br>
[ 242.303523] ret_from_fork+0x1f/0x30<br>
[ 242.303665] <br>
Showing all locks held in the system:<br>
[ 242.303679] 1 lock held by khungtaskd/508:<br>
[ 242.303684] #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
at: debug_show_all_locks+0xe/0x1b0<br>
[ 242.303713] 1 lock held by khugepaged/514:<br>
[ 242.303718] #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
lru_add_drain_all+0x37/0x1e0<br>
[ 242.303756] 6 locks held by kworker/3:2/2544:<br>
[ 242.303764] 1 lock held by in:imklog/2733:<br>
[ 242.303769] #0: ffff89a2928e58f0
(&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50<br>
[ 242.303838] 1 lock held by dmesg/4262:<br>
[ 242.303843] #0: ffff89a3079980d0
(&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0<br>
<br>
[ 242.303875] =============================================<br>
<br>
[ 311.585542] loop0: detected capacity change from 8 to 0<br>
[ 363.135280] INFO: task khugepaged:514 blocked for more than 241
seconds.<br>
[ 363.135304] Tainted: G W 5.11.0-kfd-yangp
#1420<br>
[ 363.135313] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>
[ 363.135321] task:khugepaged state:D stack: 0 pid: 514
ppid: 2 flags:0x00004000<br>
[ 363.135336] Call Trace:<br>
[ 363.135347] __schedule+0x31a/0x9f0<br>
[ 363.135365] ? wait_for_completion+0x87/0x120<br>
[ 363.135375] schedule+0x51/0xc0<br>
[ 363.135383] schedule_timeout+0x193/0x360<br>
[ 363.135395] ? mark_held_locks+0x49/0x70<br>
[ 363.135403] ? mark_held_locks+0x49/0x70<br>
[ 363.135412] ? wait_for_completion+0x87/0x120<br>
[ 363.135419] ? lockdep_hardirqs_on_prepare+0xd4/0x170<br>
[ 363.135428] ? wait_for_completion+0x87/0x120<br>
[ 363.135436] wait_for_completion+0xba/0x120<br>
[ 363.135448] __flush_work+0x273/0x480<br>
[ 363.135462] ? flush_workqueue_prep_pwqs+0x140/0x140<br>
[ 363.135486] ? lru_add_drain+0x110/0x110<br>
[ 363.135498] lru_add_drain_all+0x172/0x1e0<br>
[ 363.135511] khugepaged+0x68/0x2d10<br>
[ 363.135544] ? wait_woken+0xa0/0xa0<br>
[ 363.135558] ? collapse_pte_mapped_thp+0x3f0/0x3f0<br>
[ 363.135566] kthread+0x131/0x150<br>
[ 363.135575] ? kthread_park+0x90/0x90<br>
[ 363.135586] ret_from_fork+0x1f/0x30<br>
[ 363.135718] <br>
Showing all locks held in the system:<br>
[ 363.135731] 1 lock held by khungtaskd/508:<br>
[ 363.135737] #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
at: debug_show_all_locks+0xe/0x1b0<br>
[ 363.135765] 1 lock held by khugepaged/514:<br>
[ 363.135771] #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
lru_add_drain_all+0x37/0x1e0<br>
[ 363.135810] 5 locks held by kworker/3:2/2544:<br>
[ 363.135818] 1 lock held by in:imklog/2733:<br>
[ 363.135823] #0: ffff89a2928e58f0
(&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50<br>
[ 363.135887] 1 lock held by dmesg/4262:<br>
[ 363.135892] #0: ffff89a3079980d0
(&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0<br>
</p>
<blockquote type="cite" cite="mid:15106a30-9f85-f0ca-5e4c-b53c60c83474@amd.com">
<pre class="moz-quote-pre" wrap="">
Thanks,
Felix
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">+ }
migrate_vma_finalize(&migrate);
}
@@ -663,10 +665,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
pr_debug("cpages %ld\n", migrate.cpages);
if (migrate.cpages) {
- svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
- scratch);
- migrate_vma_pages(&migrate);
- svm_migrate_copy_done(adev, mfence);
+ r = svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
+ scratch);
+ if (!r) {
+ migrate_vma_pages(&migrate);
+ svm_migrate_copy_done(adev, mfence);
+ }
migrate_vma_finalize(&migrate);
} else {
pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n",
</pre>
</blockquote>
</blockquote>
</body>
</html>