[PATCH] drm/amdkfd: Wait vm update fence after retry fault recovered
Felix Kuehling
felix.kuehling at amd.com
Thu Sep 28 21:50:25 UTC 2023
On 2023-09-22 17:37, Philip Yang wrote:
> Otherwise kfd flush tlb does nothing if vm update fence callback doesn't
> update vm->tlb_seq. H/W will generate retry fault again.
>
> This works now because retry fault keep coming, recover will update page
> table again after AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING timeout and flush
> tlb.
I think I'm OK with this change. But as I understand it, this is really
part of another patch series that depends on this fix. It's not needed
with the way we currently handle retry faults. Am I misunderstanding it?
This is not an optimal solution, but I think it's only meant to be
temporary. I think we want to get to a solution that allows us to
schedule TLB flushes asynchronously using the fences. For now, the
impact is limited to small-BAR GPUs that use SDMA for page table
updates, so I'm OK with that.
Regards,
Felix
> Remove wait parameter in svm_range_validate_and_map because it is
> always called with true.
>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 70aa882636ab..61f4de1633a8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1447,7 +1447,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
> static int
> svm_range_map_to_gpus(struct svm_range *prange, unsigned long offset,
> unsigned long npages, bool readonly,
> - unsigned long *bitmap, bool wait, bool flush_tlb)
> + unsigned long *bitmap, bool flush_tlb)
> {
> struct kfd_process_device *pdd;
> struct amdgpu_device *bo_adev = NULL;
> @@ -1480,8 +1480,7 @@ svm_range_map_to_gpus(struct svm_range *prange, unsigned long offset,
>
> r = svm_range_map_to_gpu(pdd, prange, offset, npages, readonly,
> prange->dma_addr[gpuidx],
> - bo_adev, wait ? &fence : NULL,
> - flush_tlb);
> + bo_adev, &fence, flush_tlb);
> if (r)
> break;
>
> @@ -1605,7 +1604,7 @@ static void *kfd_svm_page_owner(struct kfd_process *p, int32_t gpuidx)
> */
> static int svm_range_validate_and_map(struct mm_struct *mm,
> struct svm_range *prange, int32_t gpuidx,
> - bool intr, bool wait, bool flush_tlb)
> + bool intr, bool flush_tlb)
> {
> struct svm_validate_context *ctx;
> unsigned long start, end, addr;
> @@ -1729,7 +1728,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>
> if (!r)
> r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> - ctx->bitmap, wait, flush_tlb);
> + ctx->bitmap, flush_tlb);
>
> if (!r && next == end)
> prange->mapped_to_gpu = true;
> @@ -1823,7 +1822,7 @@ static void svm_range_restore_work(struct work_struct *work)
> mutex_lock(&prange->migrate_mutex);
>
> r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,
> - false, true, false);
> + false, false);
> if (r)
> pr_debug("failed %d to map 0x%lx to gpus\n", r,
> prange->start);
> @@ -3064,7 +3063,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
> }
> }
>
> - r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false);
> + r = svm_range_validate_and_map(mm, prange, gpuidx, false, false);
> if (r)
> pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n",
> r, svms, prange->start, prange->last);
> @@ -3603,7 +3602,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
> flush_tlb = !migrated && update_mapping && prange->mapped_to_gpu;
>
> r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,
> - true, true, flush_tlb);
> + true, flush_tlb);
> if (r)
> pr_debug("failed %d to map svm range\n", r);
>
More information about the amd-gfx
mailing list