[RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

Wed Jan 5 13:26:40 UTC 2022

On 1/5/2022 6:45 PM, Christian König wrote:
> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>> On 1/5/2022 6:01 PM, Christian König wrote:
>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>> such as sysfs and RAS. We must serialize all possible
>>>>> GPU recoveries to gurantee no concurrency there.
>>>>> For TDR call the original recovery function directly since
>>>>> it's already executed from within the wq. For others just
>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>
>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>> +++++++++++++++++++++-
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>> amdgpu_device *adev);
>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job* job);
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> +                  struct amdgpu_job *job);
>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>    * Returns 0 for success or an error on failure.
>>>>>    */
>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job *job)
>>>>>   {
>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>       return r;
>>>>>   }
>>>>>   +struct amdgpu_recover_work_struct {
>>>>> +    struct work_struct base;
>>>>> +    struct amdgpu_device *adev;
>>>>> +    struct amdgpu_job *job;
>>>>> +    int ret;
>>>>> +};
>>>>> +
>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>> work_struct *work)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>> +
>>>>> +    recover_work->ret = 
>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>>> +}
>>>>> +/*
>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>> + */
>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +                    struct amdgpu_job *job)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>>> job};
>>>>> +
>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>> +
>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>> +        return -EAGAIN;
>>>>> +
>>>>
>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>> accesses to hardware may not be reliable. So should the flag 
>>>> in_reset be set here itself rather than waiting for the work to 
>>>> start execution?
>>>
>>> No, when we race and lose the VM is completely lost and probably 
>>> restarted by the hypervisor.
>>>
>>> And when we race and win we properly set the flag before signaling 
>>> the hypervisor that it can continue with the reset.
>>>
>>
>> I was talking about baremetal case. When this was synchronous, 
>> in_reset flag is set as one of the first things and amdgpu_in_reset is 
>> checked to prevent further hardware accesses. This design only changes 
>> the recover part and doesn't change the hardware perspective. 
> 
>> Potential accesses from other processes need to be blocked as soon as 
>> we determine a reset is required.
> 
> That's an incorrect assumption.
> 
> Accessing the hardware is perfectly ok as long as the reset hasn't 
> started yet. In other words even when the hardware is locked up you can 
> still happily read/write registers or access the VRAM BAR.
> 

Not sure if that is 100% correct like a recovery triggered by RAS error 
(depends on the access done).

Thanks,
Lijo

> Only when the hardware is currently performing a reset, then we can't 
> touch it or there might be unfortunate consequences (usually complete 
> system lockup).
> 
> Regards,
> Christian.
> 
>> Are we expecting the work to be immediately executed and set the flags?
>>
>> Thanks,
>> Lijo
>>
>>>> Also, what about having the reset_active or in_reset flag in the 
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> +    flush_work(&work.base);
>>>>> +
>>>>> +    return work.ret;
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>>    *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>       } else {
>>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>
>>>
>