[PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

Fri Oct 19 15:06:36 UTC 2018

On 10/19/2018 03:08 AM, Koenig, Christian wrote:
> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>> A ring might become unusable after reset, if that the case
>> drm_sched_entity_select_rq will choose another, working rq
>> to run the job if there is one.
>> Also, skip recovery of ring which is not ready.
> Well that is not even remotely sufficient.
>
> If we can't bring up a ring any more after a reset we would need to move
> all jobs which where previously scheduled on it and all of its entities
> to a different instance.

That one should be easy to add inside amdgpu_device_gpu_recover in case
ring is dead, we just do the same for all the jobs in 
sched->ring_mirror_list of the dead ring
before doing recovery for them, no?
>
> What you do here breaks dependencies between jobs and can result in
> unforeseen consequences including random memory writes.

Can you explain this a bit more ? AFAIK any job dependent on this job 
will still wait for it's completion
before running regardless of did this job moved to a different ring. 
What am I missing ?
>
> As far as I can see that can't be done correctly with the current
> scheduler design.
>
> Additional to that when we can't restart one instance of a ring we
> usually can't restart all others either. So the whole approach is rather
> pointless.

 From my testing looks like we can, compute ring 0 is dead but IB tests 
pass on other compute rings.

Andrey

>
> Regards,
> Christian.
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d11489e..3124ca1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	else
>>    		r = amdgpu_device_reset(adev);
>>    
>> +	/*
>> +	 * After reboot a ring might fail in which case this will
>> +	 * move the job to different rq if possible
>> +	 */
>> +	if (job) {
>> +		drm_sched_entity_select_rq(job->base.entity);
>> +		if (job->base.entity->rq) {
>> +			job->base.sched = job->base.entity->rq->sched;
>> +		} else {
>> +			job->base.sched = NULL;
>> +			r = -ENOENT;
>> +		}
>> +	}
>> +
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i];
>>    
>> -		if (!ring || !ring->sched.thread)
>> +		if (!ring || !ring->ready || !ring->sched.thread)
>>    			continue;
>>    
>>    		/* only need recovery sched of the given job's ring
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx