[PATCH] drm/amdgpu: guard ib scheduling while in reset

Christian König ckoenig.leichtzumerken at gmail.com
Thu Oct 24 16:30:05 UTC 2019


Am 24.10.19 um 17:06 schrieb Grodzovsky, Andrey:
>
>
> On 10/24/19 7:01 AM, Christian König wrote:
>> Am 24.10.19 um 12:58 schrieb S, Shirish:
>>> [Why]
>>> Upon GPU reset, kernel cleans up already submitted jobs
>>> via drm_sched_cleanup_jobs.
>>> This schedules ib's via drm_sched_main()->run_job, leading to
>>> race condition of rings being ready or not, since during reset
>>> rings may be suspended.
>>
>> NAK, exactly that's what should not happen.
>>
>> The scheduler should be suspend while a GPU reset is in progress.
>>
>> So you are running into a completely different race here.
>>
>> Please sync up with Andrey how this was able to happen.
>>
>> Regards,
>> Christian.
>
>
> Shirish - Christian makes a good point - note that in 
> amdgpu_device_gpu_recover drm_sched_stop which stop all the scheduler 
> threads is called way before we suspend the HW in 
> amdgpu_device_pre_asic_reset->amdgpu_device_ip_suspend where SDMA 
> suspension is happening and where the HW ring marked as not ready - 
> please provide call stack for when you hit [drm:amdgpu_job_run] 
> *ERROR* Error scheduling IBs (-22) to identify the code path which 
> tried to submit the SDMA IB
>

Well the most likely cause of this is that the hardware failed to resume 
after the reset.

Christian.

> Andrey
>
>
>>
>>>
>>> [How]
>>> make GPU reset's amdgpu_device_ip_resume_phase2() &
>>> amdgpu_ib_schedule() in amdgpu_job_run() mutually exclusive.
>>>
>>> Signed-off-by: Shirish S <shirish.s at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 ++
>>>   3 files changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index f4d9041..7b07a47b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -973,6 +973,7 @@ struct amdgpu_device {
>>>       bool                            in_gpu_reset;
>>>       enum pp_mp1_state               mp1_state;
>>>       struct mutex  lock_reset;
>>> +    struct mutex  lock_ib_sched;
>>>       struct amdgpu_doorbell_index doorbell_index;
>>>         int asic_reset_res;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 676cad1..63cad74 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2759,6 +2759,7 @@ int amdgpu_device_init(struct amdgpu_device 
>>> *adev,
>>>       mutex_init(&adev->virt.vf_errors.lock);
>>>       hash_init(adev->mn_hash);
>>>       mutex_init(&adev->lock_reset);
>>> +    mutex_init(&adev->lock_ib_sched);
>>>       mutex_init(&adev->virt.dpm_mutex);
>>>       mutex_init(&adev->psp.mutex);
>>>   @@ -3795,7 +3796,9 @@ static int amdgpu_do_asic_reset(struct 
>>> amdgpu_hive_info *hive,
>>>                   if (r)
>>>                       return r;
>>>   + mutex_lock(&tmp_adev->lock_ib_sched);
>>>                   r = amdgpu_device_ip_resume_phase2(tmp_adev);
>>> + mutex_unlock(&tmp_adev->lock_ib_sched);
>>>                   if (r)
>>>                       goto out;
>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index e1bad99..cd6082d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -233,8 +233,10 @@ static struct dma_fence *amdgpu_job_run(struct 
>>> drm_sched_job *sched_job)
>>>       if (finished->error < 0) {
>>>           DRM_INFO("Skip scheduling IBs!\n");
>>>       } else {
>>> +        mutex_lock(&ring->adev->lock_ib_sched);
>>>           r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>>                          &fence);
>>> +        mutex_unlock(&ring->adev->lock_ib_sched);
>>>           if (r)
>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>       }
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191024/7e16759d/attachment.html>


More information about the amd-gfx mailing list