[PATCH] drm/amdgpu: Check pending job finished or not to identify has bad job

Wed Nov 13 09:22:43 UTC 2024

Hi guys,

can you please explain to me why it's always you guys which come up with 
such nonsense?

When you need to find the number of ongoing hardware submission then 
please use the amdgpu_fence_count_emitted() function and not mess with 
any scheduler internals.

This patch here is a clear NAK from my side.

Regards,
Christian.

Am 13.11.24 um 09:46 schrieb Fan, Shikang:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> + at Koenig, Christian <mailto:Christian.Koenig at amd.com>
>
> Hi Christian,
>
> Could you please help review this patch? Thank you.
>
> Regards,
> Shikang
> ------------------------------------------------------------------------
> *From:* Shikang Fan <shikang.fan at amd.com>
> *Sent:* Wednesday, November 13, 2024 11:14 AM
> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc:* Fan, Shikang <Shikang.Fan at amd.com>; Liu01, Tong (Esther) 
> <Tong.Liu01 at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Subject:* [PATCH] drm/amdgpu: Check pending job finished or not to 
> identify has bad job
> drm_sched_free_job_work is a queue work function,
> so even job is finished in hw, it still needs some time to
> be deleted from the pending queue by drm_sched_free_job_work.
> here iterates over the pending job list and wait for each job to finish
> within specified timeout (1s by default) to avoid jobs that are not
> cleaned up in time or are about to finished.
> if wait timeout, return true
>
> Signed-off-by: Tong Liu01 <Tong.Liu01 at amd.com>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> Signed-off-by: Shikang Fan <shikang.fan at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 071d3d9b345d..da2a22618f42 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -100,6 +100,7 @@ MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");
>  #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2)
>  #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2)
>  #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2)
> +#define AMDGPU_PENDING_JOB_TIMEOUT msecs_to_jiffies(1000)
>
>  static const struct drm_driver amdgpu_kms_driver;
>
> @@ -5224,7 +5225,8 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
>  {
>          int i;
> -       struct drm_sched_job *job;
> +       struct drm_sched_job *job, *tmp;
> +       long r;
>
>          for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                  struct amdgpu_ring *ring = adev->rings[i];
> @@ -5233,11 +5235,20 @@ bool amdgpu_device_has_job_running(struct 
> amdgpu_device *adev)
>                          continue;
>
> spin_lock(&ring->sched.job_list_lock);
> -               job = list_first_entry_or_null(&ring->sched.pending_list,
> -                                              struct drm_sched_job, 
> list);
> +
> +               /* iterates over the pending job list
> +                * wait for each job to finish within timeout (1s by 
> default)
> +                * if wait timeout, return true
> +                */
> +               list_for_each_entry_safe(job, tmp, 
> &ring->sched.pending_list, list) {
> +                       r = 
> dma_fence_wait_timeout(&job->s_fence->finished,
> +                                                               false, 
> AMDGPU_PENDING_JOB_TIMEOUT);
> +                       if (r <= 0) {
> + spin_unlock(&ring->sched.job_list_lock);
> +                               return true;
> +                       }
> +               }
> spin_unlock(&ring->sched.job_list_lock);
> -               if (job)
> -                       return true;
>          }
>          return false;
>  }
> -- 
> 2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241113/dfd15216/attachment-0001.htm>