[PATCH v2] drm/amdgpu: Check fence emitted count to identify bad jobs
Christian König
christian.koenig at amd.com
Tue Nov 19 09:30:36 UTC 2024
Hi Shikang,
please completely drop the AMDGPU_PENDING_JOB_TIMEOUT workaround.
This is unnecessary when you use amdgpu_fence_count_emitted() instead of
looking at the jobs.
That's one of the reasons why looking at the jobs is such a really
really bad idea in the first place.
Regards,
Christian.
Am 19.11.24 um 09:47 schrieb Fan, Shikang:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> + at Koenig, Christian <mailto:Christian.Koenig at amd.com>
>
> Hi Christian,
>
> Could you please help take a look at this patch? Compared to the
> previous patch, we now use amdgpu_fence_emitted_count to check
> unfinished jobs. And this function is currently only used for
> mailbox_flr_work In SRIOV case, soI believe the modification on this
> function will not have any impact on the rest part of the driver.
> Thanks for your advice on v1 patch.
>
> Regards,
> Shikang
>
> ------------------------------------------------------------------------
> *From:* Shikang Fan <shikang.fan at amd.com>
> *Sent:* Monday, November 18, 2024 6:10 PM
> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc:* Fan, Shikang <Shikang.Fan at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Subject:* [PATCH v2] drm/amdgpu: Check fence emitted count to
> identify bad jobs
> In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
> guest driver, there is a small chance that there is no job running on hw
> but the driver has not updated the pending list yet, causing the driver
> not respond the FLR request. Modify the has_job_running function to
> make sure if there is still running job.
>
> v2: Use amdgpu_fence_count_emitted to determine job running status.
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> Signed-off-by: Shikang Fan <shikang.fan at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b3ca911e55d6..ea756eacebdc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -100,6 +100,7 @@ MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");
> #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2)
> #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2)
> #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2)
> +#define AMDGPU_PENDING_JOB_TIMEOUT (1000000)
>
> static const struct drm_driver amdgpu_kms_driver;
>
> @@ -5222,15 +5223,19 @@ static int amdgpu_device_reset_sriov(struct
> amdgpu_device *adev,
> }
>
> /**
> - * amdgpu_device_has_job_running - check if there is any job in
> mirror list
> + * amdgpu_device_has_job_running - check if there is any unfinished job
> *
> * @adev: amdgpu_device pointer
> *
> - * check if there is any job in mirror list
> + * check if there is any job running on the device when guest driver
> receives
> + * FLR notification from host driver. If there are still jobs running
> and not
> + * signaled after 1s, the hardware is most likely hung already, then
> the guest
> + * driver will not respond the FLR reset. Instead, let the job hit
> the timeout
> + * and guest driver then issue the reset request.
> */
> bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
> {
> - int i;
> + int i, j;
> struct drm_sched_job *job;
>
> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> @@ -5239,11 +5244,12 @@ bool amdgpu_device_has_job_running(struct
> amdgpu_device *adev)
> if (!amdgpu_ring_sched_ready(ring))
> continue;
>
> - spin_lock(&ring->sched.job_list_lock);
> - job = list_first_entry_or_null(&ring->sched.pending_list,
> - struct drm_sched_job,
> list);
> - spin_unlock(&ring->sched.job_list_lock);
> - if (job)
> + for (j = 0; j < AMDGPU_PENDING_JOB_TIMEOUT; j++) {
> + if (!amdgpu_fence_count_emitted(ring))
> + break;
> + udelay(1);
> + }
> + if (j == AMDGPU_PENDING_JOB_TIMEOUT)
> return true;
> }
> return false;
> --
> 2.34.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241119/268226ad/attachment.htm>
More information about the amd-gfx
mailing list