[PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

Thu Sep 15 11:29:57 UTC 2022

Hi Victor,

the advanced tdr mode is not a feature we want to enable in production 
and keep for much longer.

So would that issue happen without this as well? If not it is rather 
questionable if we should look into fixing this in the first place.

Regards,
Christian.

Am 15.09.22 um 12:09 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Hi Christian,
>
> The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr mode and mode2 reset enabled on driver.
> When a compute hang job timeout happens, the 2 jobs on the gfx pending list maybe signaled after drm_sched_stop. So they will not be removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
> At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and removed from pending list.
> At the resubmit setp, the second job (with signaled bit) will be resubmitted. Since it still has signaled bit, drm_sched_job_done will be called directly. This decrease the hw_rq_count which allows more jobs emitted but did not clean fence_drv rcu ptr.
> This results in an overflow in the fence_drv. Since we will use num_fences_mask in amdgpu_fence_process, when overflow happens, the signal of some job will be skipped which result in an infinite wait for the fence_drv rcu ptr.
>
> So close irq before sched_stop could avoid signal jobs after drm_sched_stop. And signal job one by one in fence_process instead of using a mask will handle the overflow situation.
>
> Another fix could be skip submitting jobs which already signaled during resubmit stage, which may look cleaner.
>
> Please help give some advice.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Thursday, September 15, 2022 2:32 PM
> To: Zhao, Victor <Victor.Zhao at amd.com>; amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Cc: Deng, Emily <Emily.Deng at amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
>
> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>> [AMD Official Use Only - General]
>>
>> Ping.
>>
>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>
>> We found some reset related issues during stress test on the sequence. Please help give some comments.
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Victor Zhao <Victor.Zhao at amd.com>
>> Sent: Wednesday, September 14, 2022 6:10 PM
>> To: amd-gfx at lists.freedesktop.org
>> Cc: Deng, Emily <Emily.Deng at amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>> [background]
>> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.
>>
>> At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
>> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>>
>> Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.
>>
>> This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.
>>
>> [how]
>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>> before drm sched stop 2. handle all fences in fence process to aviod
>> skip when overflow happens
>>
>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 943c9e750575..c0cfae52f12b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>    		amdgpu_virt_fini_data_exchange(adev);
>>    	}
>>    
>> -	amdgpu_fence_driver_isr_toggle(adev, true);
>> -
>>    	/* block all schedulers and reset given job's ring */
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		      amdgpu_device_ip_need_full_reset(tmp_adev))
>>    			amdgpu_ras_suspend(tmp_adev);
>>    
>> +		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>> +
>>    		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    			struct amdgpu_ring *ring = tmp_adev->rings[i];
>>    
>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		atomic_inc(&tmp_adev->gpu_reset_counter);
>>    	}
>>    
>> -	if (need_emergency_restart)
>> +	if (need_emergency_restart) {
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_sched_resume;
>> +	}
>>    
>>    	/*
>>    	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>    		job_signaled = true;
>>    		dev_info(adev->dev, "Guilty job already signaled, skipping HW
>> reset");
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_hw_reset;
>>    	}
>>    
>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		if (r && r == -EAGAIN) {
>>    			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>>    			adev->asic_reset_res = 0;
>> +			amdgpu_fence_driver_isr_toggle(adev, true);
>>    			goto retry;
>>    		}
>>    	}
>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>    	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>    	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>    
>> +	amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>>    	adev->no_hw_access = true;
>>    	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>    	adev->no_hw_access = false;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8adeb7469f1e..65a877e1a7fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>>    	if (unlikely(seq == last_seq))
>>    		return false;
>>    
>> -	last_seq &= drv->num_fences_mask;
>> -	seq &= drv->num_fences_mask;
>> -
>>    	do {
>>    		struct dma_fence *fence, **ptr;
>>    
>>    		++last_seq;
>> -		last_seq &= drv->num_fences_mask;
>> -		ptr = &drv->fences[last_seq];
>> +		ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>    
>>    		/* There is always exactly one thread signaling this fence slot */
>>    		fence = rcu_dereference_protected(*ptr, 1);
> Those changes here doesn't seem to make sense. Please explain further why that is necessary.
>
> Christian.
>
>> --
>> 2.25.1