[PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery

Mon Aug 31 22:04:32 UTC 2020

On 8/31/20 5:00 PM, Luben Tuikov wrote:
> On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
>> DPC recovery involves ASIC reset just as normal GPU recovery so blosk
> Again, typo: "blosk" --> "blocks".
>
>> SW GPU schedulers and wait on all concurrent GPU resets.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>>   1 file changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 43ce473..c569523 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>>   	return 0;
>>   }
>>   
>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
>> +	}
>> +}
>> +
>>   /**
>>    * amdgpu_pci_error_detected - Called when a PCI error is detected.
>>    * @pdev: PCI device struct
>> @@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>>   	DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>>   
>>   	switch (state) {
>>   	case pci_channel_io_normal:
>>   		return PCI_ERS_RESULT_CAN_RECOVER;
>> -	case pci_channel_io_frozen:
>> -		/* Fatal error, prepare for slot reset */
>> -		amdgpu_device_lock_adev(adev);
>> +	/* Fatal error, prepare f
>> +	case pci_channel_io_frozen:		
>> +		/*		
>> +		 * Cancel and wait for all TDRs in progress if failing to
>> +		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
>> +		 *
>> +		 * Locking adev->reset_sem will prevent any external access
>> +		 * to GPU during PCI error recovery
>> +		 */
>> +		while (!amdgpu_device_lock_adev(adev, NULL))
>> +			amdgpu_cancel_all_tdr(adev);
> As in v1, I don't see how this is protected from
> polling forever. "amdgpu_cancel_all_tdr()" defined above,
> doesn't actively cancel anything--it just waits.
>
> "amdgpu_device_lock_adev()" similarly only modifies
> a few variables.
>
> Neither of those two functions seem to escalate or otherwise
> reset the hardware. If the reset/hw blocks for whatever
> reason, this loop will hang forever.
>
> Regards,
> Luben


But the fact that reset blocks for ever for some reason is not acceptable anyway 
- it's a bug
we must handle and so I don't think our code here should be robust for this bug 
case because if we
are in this scenario already we will have a lot of other issues anyway and the 
driver will stop being usable.
Your comment made me see there is a question of possible starvation when trying 
to amdgpu_device_lock_adev
because we could say that if the HW is dead and jobs keep coming you will have 
new job timeouts
all the time and they could sneak in and hijack adev->in_gpu_reset so that PCIe 
error recovery never
has a chance to run but, I looked into the code and that is not the case since 
drm_sched_select_entity->drm_sched_ready
will return NULL for any scheduler that already has his HW ring full of 
unfinished jobs and so no new timeouts will be
triggered so PCIe recovery code will have eventually his opportunity to lock the 
device and run.

Andrey


>
>> +
>> +		/*
>> +		 * Block any work scheduling as we do for regular GPU reset
>> +		 * for the duration of the recovery
>> +		 */
>> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +			struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +			if (!ring || !ring->sched.thread)
>> +				continue;
>> +
>> +			drm_sched_stop(&ring->sched, NULL);
>> +		}
>>   		return PCI_ERS_RESULT_NEED_RESET;
>>   	case pci_channel_io_perm_failure:
>>   		/* Permanent error, prepare for device removal */
>> @@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> +	int i;
>>   
>> -	amdgpu_device_unlock_adev(adev);
>>   
>>   	DRM_INFO("PCI error: resume callback!!\n");
>> +
>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +		struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +		if (!ring || !ring->sched.thread)
>> +			continue;
>> +
>> +
>> +		drm_sched_resubmit_jobs(&ring->sched);
>> +		drm_sched_start(&ring->sched, true);
>> +	}
>> +
>> +	amdgpu_device_unlock_adev(adev);
>>   }
>>