[PATCH v2] drm/amdgpu: Check fence emitted count to identify bad jobs

Christian König christian.koenig at amd.com
Tue Nov 19 09:30:36 UTC 2024


Hi Shikang,

please completely drop the AMDGPU_PENDING_JOB_TIMEOUT workaround.

This is unnecessary when you use amdgpu_fence_count_emitted() instead of 
looking at the jobs.

That's one of the reasons why looking at the jobs is such a really 
really bad idea in the first place.

Regards,
Christian.

Am 19.11.24 um 09:47 schrieb Fan, Shikang:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> + at Koenig, Christian <mailto:Christian.Koenig at amd.com>
>
> Hi Christian,
>
> Could you please help take a look at this patch? Compared to the 
> previous patch, we now use amdgpu_fence_emitted_count to check 
> unfinished jobs. And this function is currently only used for 
> mailbox_flr_work In SRIOV case, soI believe the modification on this 
> function will not have any impact on the rest part of the driver. 
> Thanks for your advice on v1 patch.
>
> Regards,
> Shikang
>
> ------------------------------------------------------------------------
> *From:* Shikang Fan <shikang.fan at amd.com>
> *Sent:* Monday, November 18, 2024 6:10 PM
> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc:* Fan, Shikang <Shikang.Fan at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Subject:* [PATCH v2] drm/amdgpu: Check fence emitted count to 
> identify bad jobs
> In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
> guest driver, there is a small chance that there is no job running on hw
> but the driver has not updated the pending list yet, causing the driver
> not respond the FLR request. Modify the has_job_running function to
> make sure if there is still running job.
>
> v2: Use amdgpu_fence_count_emitted to determine job running status.
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> Signed-off-by: Shikang Fan <shikang.fan at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b3ca911e55d6..ea756eacebdc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -100,6 +100,7 @@ MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");
>  #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2)
>  #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2)
>  #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2)
> +#define AMDGPU_PENDING_JOB_TIMEOUT     (1000000)
>
>  static const struct drm_driver amdgpu_kms_driver;
>
> @@ -5222,15 +5223,19 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  }
>
>  /**
> - * amdgpu_device_has_job_running - check if there is any job in 
> mirror list
> + * amdgpu_device_has_job_running - check if there is any unfinished job
>   *
>   * @adev: amdgpu_device pointer
>   *
> - * check if there is any job in mirror list
> + * check if there is any job running on the device when guest driver 
> receives
> + * FLR notification from host driver. If there are still jobs running 
> and not
> + * signaled after 1s, the hardware is most likely hung already, then 
> the guest
> + * driver will not respond the FLR reset. Instead, let the job hit 
> the timeout
> + * and guest driver then issue the reset request.
>   */
>  bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
>  {
> -       int i;
> +       int i, j;
>          struct drm_sched_job *job;
>
>          for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> @@ -5239,11 +5244,12 @@ bool amdgpu_device_has_job_running(struct 
> amdgpu_device *adev)
>                  if (!amdgpu_ring_sched_ready(ring))
>                          continue;
>
> -               spin_lock(&ring->sched.job_list_lock);
> -               job = list_first_entry_or_null(&ring->sched.pending_list,
> -                                              struct drm_sched_job, 
> list);
> - spin_unlock(&ring->sched.job_list_lock);
> -               if (job)
> +               for (j = 0; j < AMDGPU_PENDING_JOB_TIMEOUT; j++) {
> +                       if (!amdgpu_fence_count_emitted(ring))
> +                               break;
> +                       udelay(1);
> +               }
> +               if (j == AMDGPU_PENDING_JOB_TIMEOUT)
>                          return true;
>          }
>          return false;
> --
> 2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241119/268226ad/attachment.htm>


More information about the amd-gfx mailing list