[PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

Fri Oct 19 07:08:51 UTC 2018

Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
> A ring might become unusable after reset, if that the case
> drm_sched_entity_select_rq will choose another, working rq
> to run the job if there is one.
> Also, skip recovery of ring which is not ready.

Well that is not even remotely sufficient.

If we can't bring up a ring any more after a reset we would need to move 
all jobs which where previously scheduled on it and all of its entities 
to a different instance.

What you do here breaks dependencies between jobs and can result in 
unforeseen consequences including random memory writes.

As far as I can see that can't be done correctly with the current 
scheduler design.

Additional to that when we can't restart one instance of a ring we 
usually can't restart all others either. So the whole approach is rather 
pointless.

Regards,
Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d11489e..3124ca1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	else
>   		r = amdgpu_device_reset(adev);
>   
> +	/*
> +	 * After reboot a ring might fail in which case this will
> +	 * move the job to different rq if possible
> +	 */
> +	if (job) {
> +		drm_sched_entity_select_rq(job->base.entity);
> +		if (job->base.entity->rq) {
> +			job->base.sched = job->base.entity->rq->sched;
> +		} else {
> +			job->base.sched = NULL;
> +			r = -ENOENT;
> +		}
> +	}
> +
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];
>   
> -		if (!ring || !ring->sched.thread)
> +		if (!ring || !ring->ready || !ring->sched.thread)
>   			continue;
>   
>   		/* only need recovery sched of the given job's ring