[PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

Wed Sep 14 10:10:25 UTC 2022

[background]
On current sienna cichlid mode2 reset, on the slow job hang cases,
since page table context was reverted to completely stop gpu, it
will generate page fault interrupt.

Since the irq are open during recovery stage, during ih resume step,
if this interrupt was in processing, which increased ih ring rptr,
and ih resume meanwhile will set rptr and wptr to 0. This may cause
rptr greater than wptr. Such case was not handled in ih process,
and it will cause rptr continue increasing util reaches the max.
Such case will make fence fallback situation happen.

[how]
Move the enable of irq after ih resumed and before ib test.
Adjusting the position of enable irq on other reset paths accordingly.

Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
 drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c0cfae52f12b..0b658225e9ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
-	amdgpu_fence_driver_isr_toggle(adev, false);
-
 	if (job && job->vm)
 		drm_sched_increase_karma(&job->base);
 
@@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
 	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, &reset_context->flags);
 
+	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+	}
+
 	/*
 	 * ASIC reset has to be done on all XGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
@@ -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
 			/* Clear this failed job from fence array */
 			amdgpu_fence_driver_clear_job_fences(ring);
 
-			amdgpu_fence_driver_isr_toggle(adev, false);
-
 			/* Since the job won't signal and we go for
 			 * another resubmit drop this parent pointer
 			 */
diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
index 7aa570c1ce4a..953036482d1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
+++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
@@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
 	* Add this ASIC as tracked as reset was already
 	* complete successfully.
 	*/
+	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
 	amdgpu_register_gpu_instance(tmp_adev);
 
 	/* Resume RAS */
-- 
2.25.1