[PATCH] drm/amdgpu: guard ib scheduling while in reset
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Oct 24 16:30:05 UTC 2019
Am 24.10.19 um 17:06 schrieb Grodzovsky, Andrey:
>
>
> On 10/24/19 7:01 AM, Christian König wrote:
>> Am 24.10.19 um 12:58 schrieb S, Shirish:
>>> [Why]
>>> Upon GPU reset, kernel cleans up already submitted jobs
>>> via drm_sched_cleanup_jobs.
>>> This schedules ib's via drm_sched_main()->run_job, leading to
>>> race condition of rings being ready or not, since during reset
>>> rings may be suspended.
>>
>> NAK, exactly that's what should not happen.
>>
>> The scheduler should be suspend while a GPU reset is in progress.
>>
>> So you are running into a completely different race here.
>>
>> Please sync up with Andrey how this was able to happen.
>>
>> Regards,
>> Christian.
>
>
> Shirish - Christian makes a good point - note that in
> amdgpu_device_gpu_recover drm_sched_stop which stop all the scheduler
> threads is called way before we suspend the HW in
> amdgpu_device_pre_asic_reset->amdgpu_device_ip_suspend where SDMA
> suspension is happening and where the HW ring marked as not ready -
> please provide call stack for when you hit [drm:amdgpu_job_run]
> *ERROR* Error scheduling IBs (-22) to identify the code path which
> tried to submit the SDMA IB
>
Well the most likely cause of this is that the hardware failed to resume
after the reset.
Christian.
> Andrey
>
>
>>
>>>
>>> [How]
>>> make GPU reset's amdgpu_device_ip_resume_phase2() &
>>> amdgpu_ib_schedule() in amdgpu_job_run() mutually exclusive.
>>>
>>> Signed-off-by: Shirish S <shirish.s at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 ++
>>> 3 files changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index f4d9041..7b07a47b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -973,6 +973,7 @@ struct amdgpu_device {
>>> bool in_gpu_reset;
>>> enum pp_mp1_state mp1_state;
>>> struct mutex lock_reset;
>>> + struct mutex lock_ib_sched;
>>> struct amdgpu_doorbell_index doorbell_index;
>>> int asic_reset_res;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 676cad1..63cad74 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2759,6 +2759,7 @@ int amdgpu_device_init(struct amdgpu_device
>>> *adev,
>>> mutex_init(&adev->virt.vf_errors.lock);
>>> hash_init(adev->mn_hash);
>>> mutex_init(&adev->lock_reset);
>>> + mutex_init(&adev->lock_ib_sched);
>>> mutex_init(&adev->virt.dpm_mutex);
>>> mutex_init(&adev->psp.mutex);
>>> @@ -3795,7 +3796,9 @@ static int amdgpu_do_asic_reset(struct
>>> amdgpu_hive_info *hive,
>>> if (r)
>>> return r;
>>> + mutex_lock(&tmp_adev->lock_ib_sched);
>>> r = amdgpu_device_ip_resume_phase2(tmp_adev);
>>> + mutex_unlock(&tmp_adev->lock_ib_sched);
>>> if (r)
>>> goto out;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index e1bad99..cd6082d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -233,8 +233,10 @@ static struct dma_fence *amdgpu_job_run(struct
>>> drm_sched_job *sched_job)
>>> if (finished->error < 0) {
>>> DRM_INFO("Skip scheduling IBs!\n");
>>> } else {
>>> + mutex_lock(&ring->adev->lock_ib_sched);
>>> r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>> &fence);
>>> + mutex_unlock(&ring->adev->lock_ib_sched);
>>> if (r)
>>> DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>> }
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191024/7e16759d/attachment.html>
More information about the amd-gfx
mailing list