[PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset

Mon May 8 09:19:50 UTC 2017

> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
Mhm, good argument. But in this case we need to give at least some 
warning message instead of doing nothing.

Or even better disable creating the amdgpu_reste debugfs file 
altogether. This way nobody will wonder why using it doesn't trigger 
anything.

Christian.

Am 08.05.2017 um 11:10 schrieb Liu, Monk:
> For SR-IOV use case, we call gpu reset under the case we have no choice ...
>
> So many places like debug fs shouldn't a good reason to trigger gpu reset
>
> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 08, 2017 5:08 PM
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset
>
> Am 08.05.2017 um 08:51 schrieb Monk Liu:
>> because we don't want to do sriov-gpu-reset under certain cases, so
>> just split those two funtion and don't invoke sr-iov one from
>> bare-metal one.
>>
>> Change-Id: I641126c241e2ee2dfd54e6d16c389b159f99cfe0
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 3 ++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    | 2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 3 ++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 6 +++++-
>>    5 files changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 45a60a6..4985a7e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2652,9 +2652,6 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
>>    	int resched;
>>    	bool need_full_reset;
>>    
>> -	if (amdgpu_sriov_vf(adev))
>> -		return amdgpu_sriov_gpu_reset(adev, true);
>> -
>>    	if (!amdgpu_check_soft_reset(adev)) {
>>    		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>>    		return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 5772ef2..d7523d1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -651,7 +651,8 @@ static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
>>    	struct amdgpu_device *adev = dev->dev_private;
>>    
>>    	seq_printf(m, "gpu reset\n");
>> -	amdgpu_gpu_reset(adev);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		amdgpu_gpu_reset(adev);
> Well that is clearly not a good idea. Why do you want to disable the reset here?
>
>>    
>>    	return 0;
>>    }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index 67be795..5bcbea0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -221,7 +221,7 @@ void amdgpu_gem_object_close(struct drm_gem_object
>> *obj,
>>    
>>    static int amdgpu_gem_handle_lockup(struct amdgpu_device *adev, int r)
>>    {
>> -	if (r == -EDEADLK) {
>> +	if (r == -EDEADLK && !amdgpu_sriov_vf(adev)) {
> Not a problem of your patch, but that stuff is outdated and should have been removed completely years ago. Going to take care of that.
>
>>    		r = amdgpu_gpu_reset(adev);
>>    		if (!r)
>>    			r = -EAGAIN;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> index f8a6c95..49c6e6e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> @@ -89,7 +89,8 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>>    	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
>>    						  reset_work);
>>    
>> -	amdgpu_gpu_reset(adev);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		amdgpu_gpu_reset(adev);
> Mhm, that disables the reset on an invalid register access or invalid command stream. Is that really what we want?
>
> Christian.
>
>>    }
>>    
>>    /* Disable *all* interrupts */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 690ef3d..c7718af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -36,7 +36,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>    		  job->base.sched->name,
>>    		  atomic_read(&job->ring->fence_drv.last_seq),
>>    		  job->ring->fence_drv.sync_seq);
>> -	amdgpu_gpu_reset(job->adev);
>> +
>> +	if (amdgpu_sriov_vf(job->adev))
>> +		amdgpu_sriov_gpu_reset(job->adev, true);
>> +	else
>> +		amdgpu_gpu_reset(job->adev);
>>    }
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx