[PATCH] drm/xe/pm: Move xe_rpm_lockmap_acquire

Wed Sep 11 10:58:52 UTC 2024

> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Wednesday, September 11, 2024 4:11 PM
> To: Kandpal, Suraj <suraj.kandpal at intel.com>; intel-xe at lists.freedesktop.org
> Cc: Shankar, Uma <uma.shankar at intel.com>; Vivi, Rodrigo
> <rodrigo.vivi at intel.com>
> Subject: Re: [PATCH] drm/xe/pm: Move xe_rpm_lockmap_acquire
> 
> On 11/09/2024 10:30, Suraj Kandpal wrote:
> > Move xe_rpm_lockmap_acquire after display_pm_suspend and resume
> > funtions to avoid cirular locking dependency because of locks being
> > taken in intel_fbdev, intel_dp_mst_mgr suspend and resume functions.
> >
> > Signed-off-by: Suraj Kandpal <suraj.kandpal at intel.com>
> 
> Can you provide the full lockdep splat or a link to it? We also need to give some
> solid analysis on why we think the splat is a false positive.

Sure Mathew 
Find the link below there are two different splats one with fbdev and one with drm_dp_mst_mgr
Both with the same root cause
https://gfx-ci.igk.intel.com/cibuglog-ng/issue/13490?query_key=4eadcfe577e7b807f83bf5a9031dec11ae123b77
1) https://intel-gfx-ci.01.org/tree/intel-xe/xe-1775-f0a6824d9e4ba2e1beabab4e3eeb195aa7fea167/re-dg2-17/igt@xe_pm@d3cold-mmap-vram.html#dmesg-warnings266
2) https://intel-gfx-ci.01.org/tree/intel-xe/xe-1924-f3eded4f8a05d73a0b94f27e05737ea3427450b3/re-dg2-15/igt@xe_pm@d3cold-mmap-system.html#dmesg-warnings172
History of the issue can be found on 
VLK-61411
From what I saw it was being caused because the lock dep map ended up taking all the locks that are
Later required to even the resume say mst config which takes mgr->lock in drm_dp_mst_msgr_suspend and resume
Functions. And hw_mutex lock which is taken when accessing dpcd in drm_dp_dpcd access.

> 
> > ---
> >   drivers/gpu/drm/xe/xe_pm.c | 28 ++++++++++++++--------------
> >   1 file changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > index a3d1509066f7..7f33e553728a 100644
> > --- a/drivers/gpu/drm/xe/xe_pm.c
> > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > @@ -363,6 +363,18 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
> >   	/* Disable access_ongoing asserts and prevent recursive pm calls */
> >   	xe_pm_write_callback_task(xe, current);
> >
> > +	/*
> > +	 * Applying lock for entire list op as xe_ttm_bo_destroy and
> xe_bo_move_notify
> > +	 * also checks and delets bo entry from user fault list.
> > +	 */
> > +	mutex_lock(&xe->mem_access.vram_userfault.lock);
> > +	list_for_each_entry_safe(bo, on,
> > +				 &xe->mem_access.vram_userfault.list,
> vram_userfault_link)
> > +		xe_bo_runtime_pm_release_mmap_offset(bo);
> > +	mutex_unlock(&xe->mem_access.vram_userfault.lock);
> > +
> > +	xe_display_pm_runtime_suspend(xe);
> > +
> >   	/*
> >   	 * The actual xe_pm_runtime_put() is always async underneath, so
> >   	 * exactly where that is called should makes no difference to us.
> > However @@ -386,18 +398,6 @@ int xe_pm_runtime_suspend(struct
> xe_device *xe)
> >   	 */
> >   	xe_rpm_lockmap_acquire(xe);
> >
> > -	/*
> > -	 * Applying lock for entire list op as xe_ttm_bo_destroy and
> xe_bo_move_notify
> > -	 * also checks and delets bo entry from user fault list.
> > -	 */
> > -	mutex_lock(&xe->mem_access.vram_userfault.lock);
> > -	list_for_each_entry_safe(bo, on,
> > -				 &xe->mem_access.vram_userfault.list,
> vram_userfault_link)
> > -		xe_bo_runtime_pm_release_mmap_offset(bo);
> > -	mutex_unlock(&xe->mem_access.vram_userfault.lock);
> > -
> > -	xe_display_pm_runtime_suspend(xe);
> > -
> >   	if (xe->d3cold.allowed) {
> >   		err = xe_bo_evict_all(xe);
> >   		if (err)
> > @@ -438,8 +438,6 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> >   	/* Disable access_ongoing asserts and prevent recursive pm calls */
> >   	xe_pm_write_callback_task(xe, current);
> >
> > -	xe_rpm_lockmap_acquire(xe);
> > -
> >   	if (xe->d3cold.allowed) {
> >   		err = xe_pcode_ready(xe, true);
> >   		if (err)
> > @@ -463,6 +461,8 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> >
> >   	xe_display_pm_runtime_resume(xe);
> >
> > +	xe_rpm_lockmap_acquire(xe);
> > +
> >   	if (xe->d3cold.allowed) {
> >   		err = xe_bo_restore_user(xe);
> >   		if (err)