[EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

Tue Apr 19 16:18:33 UTC 2022

Am 2022-04-19 um 12:01 schrieb Andrey Grodzovsky:
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> @@ -134,6 +134,7 @@ struct amdkfd_process_info {
>>>> /* MMU-notifier related fields */
>>>> atomic_t evicted_bos;
>>>> +atomic_t invalid;
>>>> struct delayed_work restore_userptr_work;
>>>> struct pid *pid;
>>>>  };
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index 99d2b15bcbf3..2a588eb9f456 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
>>>> void **process_info,
>>>> info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
>>>> atomic_set(&info->evicted_bos, 0);
>>>> +atomic_set(&info->invalid, 0);
>>>> INIT_DELAYED_WORK(&info->restore_userptr_work,
>>>>  amdgpu_amdkfd_restore_userptr_worker);
>>>> @@ -2693,6 +2694,9 @@ static void 
>>>> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>>>> struct mm_struct *mm;
>>>> int evicted_bos;
>>>> +if (atomic_read(&process_info->invalid))
>>>> +return;
>>>> +
>>>
>>>
>>> Probably better  to again use drm_dev_enter/exit guard pair instead 
>>> of this flag.
>>>
>>>
>>
>> I don’t know if I could use drm_dev_enter/exit efficiently because a 
>> process can have multiple drm_dev open. And I don’t know how I can 
>> recover/refer drm_dev(s) efficiently in the worker function in order 
>> to use drm_dev_enter/exit.
>
>
> I think that within the KFD code each kfd device belongs or points to 
> one specific drm_device so I don't think this is a problem.
>
Sorry, I haven't been following this discussion in all its details. But 
I don't see why you need to check a flag in the worker. If the GPU is 
unplugged you already cancel any pending work. How is new work getting 
scheduled after the GPU is unplugged? Is it due to pending interrupts or 
something? Can you instead invalidate process_info->restore_userptr_work 
to prevent it from being scheduled again? Or add some check where it's 
scheduling the work, instead of in the worker.

Regards,
   Felix