[RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
Lazar, Lijo
lijo.lazar at amd.com
Wed Jan 5 13:26:40 UTC 2022
On 1/5/2022 6:45 PM, Christian König wrote:
> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>> On 1/5/2022 6:01 PM, Christian König wrote:
>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>> such as sysfs and RAS. We must serialize all possible
>>>>> GPU recoveries to gurantee no concurrency there.
>>>>> For TDR call the original recovery function directly since
>>>>> it's already executed from within the wq. For others just
>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>
>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33
>>>>> +++++++++++++++++++++-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>>>> 3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct
>>>>> amdgpu_device *adev);
>>>>> bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>> int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> struct amdgpu_job* job);
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> + struct amdgpu_job *job);
>>>>> void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>> int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>> bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>> * Returns 0 for success or an error on failure.
>>>>> */
>>>>> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> struct amdgpu_job *job)
>>>>> {
>>>>> struct list_head device_list, *device_list_handle = NULL;
>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>> return r;
>>>>> }
>>>>> +struct amdgpu_recover_work_struct {
>>>>> + struct work_struct base;
>>>>> + struct amdgpu_device *adev;
>>>>> + struct amdgpu_job *job;
>>>>> + int ret;
>>>>> +};
>>>>> +
>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct
>>>>> work_struct *work)
>>>>> +{
>>>>> + struct amdgpu_recover_work_struct *recover_work =
>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>> +
>>>>> + recover_work->ret =
>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>>> +}
>>>>> +/*
>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>> + */
>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> + struct amdgpu_job *job)
>>>>> +{
>>>>> + struct amdgpu_recover_work_struct work = {.adev = adev, .job =
>>>>> job};
>>>>> +
>>>>> + INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>> +
>>>>> + if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>> + return -EAGAIN;
>>>>> +
>>>>
>>>> The decision to schedule a reset is made at this point. Subsequent
>>>> accesses to hardware may not be reliable. So should the flag
>>>> in_reset be set here itself rather than waiting for the work to
>>>> start execution?
>>>
>>> No, when we race and lose the VM is completely lost and probably
>>> restarted by the hypervisor.
>>>
>>> And when we race and win we properly set the flag before signaling
>>> the hypervisor that it can continue with the reset.
>>>
>>
>> I was talking about baremetal case. When this was synchronous,
>> in_reset flag is set as one of the first things and amdgpu_in_reset is
>> checked to prevent further hardware accesses. This design only changes
>> the recover part and doesn't change the hardware perspective.
>
>> Potential accesses from other processes need to be blocked as soon as
>> we determine a reset is required.
>
> That's an incorrect assumption.
>
> Accessing the hardware is perfectly ok as long as the reset hasn't
> started yet. In other words even when the hardware is locked up you can
> still happily read/write registers or access the VRAM BAR.
>
Not sure if that is 100% correct like a recovery triggered by RAS error
(depends on the access done).
Thanks,
Lijo
> Only when the hardware is currently performing a reset, then we can't
> touch it or there might be unfortunate consequences (usually complete
> system lockup).
>
> Regards,
> Christian.
>
>> Are we expecting the work to be immediately executed and set the flags?
>>
>> Thanks,
>> Lijo
>>
>>>> Also, what about having the reset_active or in_reset flag in the
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> + flush_work(&work.base);
>>>>> +
>>>>> + return work.ret;
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>> *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat
>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>> ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>> if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>> - amdgpu_device_gpu_recover(ring->adev, job);
>>>>> + amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>> } else {
>>>>> drm_sched_suspend_timeout(&ring->sched);
>>>>> if (amdgpu_sriov_vf(adev))
>>>>>
>>>
>
More information about the dri-devel
mailing list