[PATCH] drm/amdkfd: avoid recursive lock in migrations back to RAM
Felix Kuehling
felix.kuehling at amd.com
Tue Nov 2 14:18:50 UTC 2021
Am 2021-11-01 um 10:40 p.m. schrieb Alex Sierra:
> [Why]:
> When we call hmm_range_fault to map memory after a migration, we don't
> expect memory to be migrated again as a result of hmm_range_fault. The
> driver ensures that all memory is in GPU-accessible locations so that
> no migration should be needed. However, there is one corner case where
> hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
> back to system memory due to a write-fault when a system memory page in
> the same range was mapped read-only (e.g. COW). Ranges with individual
> pages in different locations are usually the result of failed page
> migrations (e.g. page lock contention). The unexpected migration back
> to system memory causes a deadlock from recursive locking in our
> driver.
>
> [How]:
> Creating a task reference new member under svm_range_list_init struct.
The _init is not part of the struct name. With that fixed, the patch is
Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
> Setting this with "current" reference, right before the hmm_range_fault
> is called. This member is checked against "current" reference at
> svm_migrate_to_ram callback function. If equal, the migration will be
> ignored.
>
> Signed-off-by: Alex Sierra <alex.sierra at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 4 ++++
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++
> 3 files changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index bff40e8bca67..eb19f44ec86d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -936,6 +936,10 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
> pr_debug("failed find process at fault address 0x%lx\n", addr);
> return VM_FAULT_SIGBUS;
> }
> + if (READ_ONCE(p->svms.faulting_task) == current) {
> + pr_debug("skipping ram migration\n");
> + return 0;
> + }
> addr >>= PAGE_SHIFT;
> pr_debug("CPU page fault svms 0x%p address 0x%lx\n", &p->svms, addr);
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index f88666bdf57c..7b41a58b1ade 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -858,6 +858,7 @@ struct svm_range_list {
> atomic_t evicted_ranges;
> struct delayed_work restore_work;
> DECLARE_BITMAP(bitmap_supported, MAX_GPU_INSTANCE);
> + struct task_struct *faulting_task;
> };
>
> /* Process data */
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 939c863315ba..4031c2a67af4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1492,9 +1492,11 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>
> next = min(vma->vm_end, end);
> npages = (next - addr) >> PAGE_SHIFT;
> + WRITE_ONCE(p->svms.faulting_task, current);
> r = amdgpu_hmm_range_get_pages(&prange->notifier, mm, NULL,
> addr, npages, &hmm_range,
> readonly, true, owner);
> + WRITE_ONCE(p->svms.faulting_task, NULL);
> if (r) {
> pr_debug("failed %d to get svm range pages\n", r);
> goto unreserve_out;
More information about the amd-gfx
mailing list