[PATCH] drm/xe: Kill missing outer runtime PM protection warning
Matthew Auld
matthew.auld at intel.com
Wed Sep 4 12:49:47 UTC 2024
On 03/09/2024 23:38, Rodrigo Vivi wrote:
> This message was very useful to ensure that Xe was taking all
> the needed outer runtime pm references. However, at this point
> it is only a false positive. So, remove it.
>
> False positive cases:
>
> 1:
> [184.983389] xe ...: [drm] Missing outer runtime PM protection
> [snip]
> [184.984096] drm_ioctl+0x2cf/0x580 [drm]
> [snip]
> [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
>
> In this case the underflow is the problem since we are sure that
> the ioctl is protected. But something else is abusing the 'put'
> calls.
>
> 2:
> rpm_status: 0000:03:00.0 status=RPM_SUSPENDING
> console: xe_bo_evict_all (called from suspend)
> xe_sched_job_create: dev=0000:03:00.0, ...
> xe_sched_job_exec: dev=0000:03:00.0, ...
> xe_pm_runtime_put: dev=0000:03:00.0, ...
> xe_sched_job_run: dev=0000:03:00.0, ...
> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
> console: xe 0000:03:00.0: [drm] Missing outer runtime
> PM protection
> console: xe_guc_ct_send+0x15/0x50 [xe]
> console: guc_exec_queue_run_job+0x1509/0x3950 [xe]
> [snip]
> console: drm_sched_run_job_work+0x649/0xc20
>
> At this point, BOs are getting evicted from VRAM with rpm
> usage-counter = 2, but rpm status = SUSPENDING.
> The xe->pm_callback_task won't be equal 'current' because this call is
> coming from a work queue.
>
> So, pm_runtime_get_if_active() will be called and return 0 because rpm
> status != ACTIVE (but equal SUSPENDING).
>
> The only way out is to just grab the reference and move on.
>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> ---
> drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
> 1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> index da68cd689a96..e1a5e43b0f34 100644
> --- a/drivers/gpu/drm/xe/xe_pm.c
> +++ b/drivers/gpu/drm/xe/xe_pm.c
> @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
> * @xe: xe device instance
> *
> - * This function should be used in inner places where it is surely already
> + * This function should *only* be used in inner places where it is surely already
> * protected by outer-bound callers of `xe_pm_runtime_get`.
> - * It will warn if not protected.
> * The reference should be put back after this function regardless, since it
> * will always bump the usage counter, regardless.
> */
> void xe_pm_runtime_get_noresume(struct xe_device *xe)
> {
> - bool ref;
> -
> - ref = xe_pm_runtime_get_if_in_use(xe);
> -
> - if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> - pm_runtime_get_noresume(xe->drm.dev);
This has proven to find real bugs in the past, right? If so, it seems
unfortunate to drop this completely? What about making it slightly more
fuzzy with something like:
drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...
That should avoid the false positive, at the cost of not finding some
real bugs, but at least gives us something?
> + pm_runtime_get_noresume(xe->drm.dev);
> }
>
> /**
More information about the Intel-xe
mailing list