<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Calibri;font-size:10pt;color:#0000FF;margin:5pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[AMD Official Use Only - AMD Internal Distribution Only]<br>
</p>
<br>
<div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
ABANDON this patch. Need further modification.<br>
<br>
Regards,</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
Shikang</div>
<div id="appendonsend"></div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif; font-size:11pt; color:rgb(0,0,0)">
<br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Shikang Fan <shikang.fan@amd.com><br>
<b>Sent:</b> Thursday, November 21, 2024 11:43 AM<br>
<b>To:</b> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
<b>Cc:</b> Fan, Shikang <Shikang.Fan@amd.com>; Deng, Emily <Emily.Deng@amd.com><br>
<b>Subject:</b> [PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="PlainText">In SRIOV, when host driver performs MODE 1 reset and notifies FLR to<br>
guest driver, there is a small chance that there is no job running on hw<br>
but the driver has not updated the pending list yet, causing the driver<br>
not respond the FLR request. Modify the has_job_running function to<br>
make sure if there is still running job.<br>
<br>
v2: Use amdgpu_fence_count_emitted to determine job running status.<br>
v3: Remove the timeout wait in has_job_running<br>
<br>
Signed-off-by: Emily Deng <Emily.Deng@amd.com><br>
Signed-off-by: Shikang Fan <shikang.fan@amd.com><br>
---<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 ++++++++--------<br>
1 file changed, 8 insertions(+), 8 deletions(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
index b3ca911e55d6..ff9995c0f764 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
@@ -5222,15 +5222,19 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,<br>
}<br>
<br>
/**<br>
- * amdgpu_device_has_job_running - check if there is any job in mirror list<br>
+ * amdgpu_device_has_job_running - check if there is any unfinished job<br>
*<br>
* @adev: amdgpu_device pointer<br>
*<br>
- * check if there is any job in mirror list<br>
+ * check if there is any job running on the device when guest driver receives<br>
+ * FLR notification from host driver. If there are still jobs running and not<br>
+ * signaled after 1s, the hardware is most likely hung already, then the guest<br>
+ * driver will not respond the FLR reset. Instead, let the job hit the timeout<br>
+ * and guest driver then issue the reset request.<br>
*/<br>
bool amdgpu_device_has_job_running(struct amdgpu_device *adev)<br>
{<br>
- int i;<br>
+ int i, j;<br>
struct drm_sched_job *job;<br>
<br>
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {<br>
@@ -5239,11 +5243,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)<br>
if (!amdgpu_ring_sched_ready(ring))<br>
continue;<br>
<br>
- spin_lock(&ring->sched.job_list_lock);<br>
- job = list_first_entry_or_null(&ring->sched.pending_list,<br>
- struct drm_sched_job, list);<br>
- spin_unlock(&ring->sched.job_list_lock);<br>
- if (job)<br>
+ if (amdgpu_fence_count_emitted(ring))<br>
return true;<br>
}<br>
return false;<br>
-- <br>
2.34.1<br>
<br>
</div>
</span></font></div>
</div>
</body>
</html>