[PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs

Fan, Shikang Shikang.Fan at amd.com
Thu Nov 21 08:16:22 UTC 2024


[AMD Official Use Only - AMD Internal Distribution Only]

Ok, thank you!

Regards,
Shikang.

________________________________
From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Thursday, November 21, 2024 4:12 PM
To: Fan, Shikang <Shikang.Fan at amd.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Deng, Emily <Emily.Deng at amd.com>
Subject: Re: [PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs

Yeah, just wanted to point out the unused variable as well.

With that fixed the patch is Reviewed-by: Christian König <christian.koenig at amd.com><mailto:christian.koenig at amd.com>

Regards,
Christian.

Am 21.11.24 um 07:49 schrieb Fan, Shikang:

[AMD Official Use Only - AMD Internal Distribution Only]

I forgot to delete the unused counter "j" from the patch, I'll remove it when submit the patch to the branch.

Thanks,
Shikang

________________________________
From: Fan, Shikang <Shikang.Fan at amd.com><mailto:Shikang.Fan at amd.com>
Sent: Thursday, November 21, 2024 2:47 PM
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org><mailto:amd-gfx at lists.freedesktop.org>; Koenig, Christian <Christian.Koenig at amd.com><mailto:Christian.Koenig at amd.com>
Cc: Deng, Emily <Emily.Deng at amd.com><mailto:Emily.Deng at amd.com>
Subject: Re: [PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs

+ at Koenig, Christian<mailto:Christian.Koenig at amd.com>

Hi Christian,
Could you please help review this patch? I removed the timeout wait in the function.

Thanks,
Shikang

________________________________
From: Shikang Fan <shikang.fan at amd.com><mailto:shikang.fan at amd.com>
Sent: Thursday, November 21, 2024 11:48 AM
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org><mailto:amd-gfx at lists.freedesktop.org>
Cc: Fan, Shikang <Shikang.Fan at amd.com><mailto:Shikang.Fan at amd.com>; Deng, Emily <Emily.Deng at amd.com><mailto:Emily.Deng at amd.com>
Subject: [PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs

In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
guest driver, there is a small chance that there is no job running on hw
but the driver has not updated the pending list yet, causing the driver
not respond the FLR request. Modify the has_job_running function to
make sure if there is still running job.

v2: Use amdgpu_fence_count_emitted to determine job running status.
v3: Remove the timeout wait in has_job_running

Signed-off-by: Emily Deng <Emily.Deng at amd.com><mailto:Emily.Deng at amd.com>
Signed-off-by: Shikang Fan <shikang.fan at amd.com><mailto:shikang.fan at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b3ca911e55d6..f53889ce71a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5222,15 +5222,18 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 }

 /**
- * amdgpu_device_has_job_running - check if there is any job in mirror list
+ * amdgpu_device_has_job_running - check if there is any unfinished job
  *
  * @adev: amdgpu_device pointer
  *
- * check if there is any job in mirror list
+ * check if there is any job running on the device when guest driver receives
+ * FLR notification from host driver. If there are still jobs running, then
+ * the guest driver will not respond the FLR reset. Instead, let the job hit
+ * the timeout and guest driver then issue the reset request.
  */
 bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 {
-       int i;
+       int i, j;
         struct drm_sched_job *job;

         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
@@ -5239,11 +5242,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
                 if (!amdgpu_ring_sched_ready(ring))
                         continue;

-               spin_lock(&ring->sched.job_list_lock);
-               job = list_first_entry_or_null(&ring->sched.pending_list,
-                                              struct drm_sched_job, list);
-               spin_unlock(&ring->sched.job_list_lock);
-               if (job)
+               if (amdgpu_fence_count_emitted(ring))
                         return true;
         }
         return false;
--
2.34.1


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241121/60b53f2a/attachment.htm>


More information about the amd-gfx mailing list