[PATCH v3] drm/amdgpu: Check fence emitted count to identify bad jobs

Thu Nov 21 08:12:52 UTC 2024

Yeah, just wanted to point out the unused variable as well.

With that fixed the patch is Reviewed-by: Christian König 
<christian.koenig at amd.com>

Regards,
Christian.

Am 21.11.24 um 07:49 schrieb Fan, Shikang:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> I forgot to delete the unused counter "j" from the patch, I'll remove 
> it when submit the patch to the branch.
>
> Thanks,
> Shikang
>
> ------------------------------------------------------------------------
> *From:* Fan, Shikang <Shikang.Fan at amd.com>
> *Sent:* Thursday, November 21, 2024 2:47 PM
> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>; 
> Koenig, Christian <Christian.Koenig at amd.com>
> *Cc:* Deng, Emily <Emily.Deng at amd.com>
> *Subject:* Re: [PATCH v3] drm/amdgpu: Check fence emitted count to 
> identify bad jobs
> + at Koenig, Christian <mailto:Christian.Koenig at amd.com>
>
> Hi Christian,
> Could you please help review this patch? I removed the timeout wait in 
> the function.
>
> Thanks,
> Shikang
>
> ------------------------------------------------------------------------
> *From:* Shikang Fan <shikang.fan at amd.com>
> *Sent:* Thursday, November 21, 2024 11:48 AM
> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc:* Fan, Shikang <Shikang.Fan at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Subject:* [PATCH v3] drm/amdgpu: Check fence emitted count to 
> identify bad jobs
> In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
> guest driver, there is a small chance that there is no job running on hw
> but the driver has not updated the pending list yet, causing the driver
> not respond the FLR request. Modify the has_job_running function to
> make sure if there is still running job.
>
> v2: Use amdgpu_fence_count_emitted to determine job running status.
> v3: Remove the timeout wait in has_job_running
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> Signed-off-by: Shikang Fan <shikang.fan at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b3ca911e55d6..f53889ce71a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5222,15 +5222,18 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  }
>
>  /**
> - * amdgpu_device_has_job_running - check if there is any job in 
> mirror list
> + * amdgpu_device_has_job_running - check if there is any unfinished job
>   *
>   * @adev: amdgpu_device pointer
>   *
> - * check if there is any job in mirror list
> + * check if there is any job running on the device when guest driver 
> receives
> + * FLR notification from host driver. If there are still jobs 
> running, then
> + * the guest driver will not respond the FLR reset. Instead, let the 
> job hit
> + * the timeout and guest driver then issue the reset request.
>   */
>  bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
>  {
> -       int i;
> +       int i, j;
>          struct drm_sched_job *job;
>
>          for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> @@ -5239,11 +5242,7 @@ bool amdgpu_device_has_job_running(struct 
> amdgpu_device *adev)
>                  if (!amdgpu_ring_sched_ready(ring))
>                          continue;
>
> - spin_lock(&ring->sched.job_list_lock);
> -               job = list_first_entry_or_null(&ring->sched.pending_list,
> -                                              struct drm_sched_job, 
> list);
> - spin_unlock(&ring->sched.job_list_lock);
> -               if (job)
> +               if (amdgpu_fence_count_emitted(ring))
>                          return true;
>          }
>          return false;
> --
> 2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241121/684167bf/attachment-0001.htm>