[PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

Liu, Shaoyun Shaoyun.Liu at amd.com
Wed Oct 26 16:10:50 UTC 2022


[AMD Official Use Only - General]

The  user space  shouldn't care about  SRIOV or not ,  I don't think we need to keep the re-submission for SRIOV as well.  The reset from SRIOV could trigger the  host do a whole GPU reset which will have the same issue as bare metal.

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Christian König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben <Luben.Tuikov at amd.com>; Prosyak, Vitaly <Vitaly.Prosyak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; daniel.vetter at ffwll.ch; amd-gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org
Cc: Koenig, Christian <Christian.Koenig at amd.com>
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

Re-submitting IBs by the kernel has many problems because pre- requisite state is not automatically re-created as well. In other words neither binary semaphores nor things like ring buffer pointers are in the state they should be when the hardware starts to work on the IBs again.

Additional to that even after more than 5 years of developing this feature it is still not stable and we have massively problems getting the reference counts right.

As discussed with user space developers this behavior is not helpful in the first place. For graphics and multimedia workloads it makes much more sense to either completely re-create the context or at least re-submitting the IBs from userspace.

For compute use cases re-submitting is also not very helpful since userspace must rely on the accuracy of the result.

Because of this we stop this practice and instead just properly note that the fence submission was canceled. The only use case we keep the re-submission for now is SRIOV and function level resets.

Signed-off-by: Christian König <christian.koenig at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
                                continue;

                        /* No point to resubmit jobs if we didn't HW reset*/
-                       if (!tmp_adev->asic_reset_res && !job_signaled)
+                       if (!tmp_adev->asic_reset_res && !job_signaled &&
+                           amdgpu_sriov_vf(tmp_adev))
                                drm_sched_resubmit_jobs(&ring->sched);

                        drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res);
--
2.25.1



More information about the dri-devel mailing list