[PATCH 1/3] drm/xe: use devm instead of drmm for managed bo

Mon Aug 12 16:38:58 UTC 2024

On 8/12/2024 3:41 AM, Matthew Auld wrote:
> On 10/08/2024 00:12, Daniele Ceraolo Spurio wrote:
>> The BO cleanup touches the GGTT and therefore requires the HW to be
>> available, so we need to use devm instead of drmm.
>
> In the BO ggtt cleanup we have drm_dev_enter() to mark the critical 
> sections that needs HW interaction vs the bits that just touch SW 
> stuff, but looks like this only works once we have marked the device 
> as unplugged. If something blows up during the probe, then the mmio 
> stuff is still unmapped and set to NULL (mmio_fini or something IIRC), 
> but the dev_enter() still sees the device as attached as part of the 
> later drmm and we blow up.
>
> It might make sense to tweak the driver to call the dev unplug() in 
> the error unwind during the probe sequence, that way the 
> drm_dev_enter() will catch this (I think). If we error out during 
> probe, then device can be considered unplugged at the end. Or perhaps 
> we should anyway make this change regardless of this patch?
>
> My thinking with not converting xe_managed_* over to drmm was that we 
> anyway have to deal with userspace objects existing after the HW is 
> removed, and there we might also have to consider ggtt, like with 
> display surfaces. Also the BO is largely just software state and can 
> be tied to life cycle of the driver state, but I guess here this is 
> internal and closely tied to the operation of the HW.
>
>>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1160
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Cc: Matthew Auld <matthew.auld at intel.com>
>
> If calling unplug doesn't make sense, or is considered orthogonal and 
> only makes sense for other drmm users:

I'm not familiar enough with this code to know what's the better choice 
here. I didn't even know drm_dev_enter() existed before you mentioned 
it, but that explains why we only see this problem on probe abort and 
not on driver remove, because we only call drm_dev_unplug in the latter 
case. Weirdly, drm_dev_unplug is called as part of 
xe_device_remove_display(), which makes it look like part of the display 
cleanup instead of the more general one.

IMO, using drmm for HW-accessing functions and relying on the fact that 
we correctly mark the HW-touching blocks with drm_dev_enter/exit seems 
more error prone than just using devm, so switching seems safer; is 
there any advantage to sticking with drmm instead of switching to devm?

If we decide to stick to drmm, we'll need to review all callbacks to 
make sure they have the enter/exit calls where needed. E.g, the 
permanent exec_queue cleanup (being called from both the migration and 
the GSC drmm callbacks) does an unconditional xe_pm_runtime_get/put, 
which seems wrong if this can be called after the HW has been detached 
(and implies that the function can end up accessing HW).

Thoughts?

Daniele

> Reviewed-by: Matthew Auld <matthew.auld at intel.com>
>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 3295bc92d7aa..45652d7e6fa6 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -1576,7 +1576,7 @@ struct xe_bo *xe_bo_create_from_data(struct 
>> xe_device *xe, struct xe_tile *tile,
>>       return bo;
>>   }
>>   -static void __xe_bo_unpin_map_no_vm(struct drm_device *drm, void 
>> *arg)
>> +static void __xe_bo_unpin_map_no_vm(void *arg)
>>   {
>>       xe_bo_unpin_map_no_vm(arg);
>>   }
>> @@ -1591,7 +1591,7 @@ struct xe_bo 
>> *xe_managed_bo_create_pin_map(struct xe_device *xe, struct xe_tile
>>       if (IS_ERR(bo))
>>           return bo;
>>   -    ret = drmm_add_action_or_reset(&xe->drm, 
>> __xe_bo_unpin_map_no_vm, bo);
>> +    ret = devm_add_action_or_reset(xe->drm.dev, 
>> __xe_bo_unpin_map_no_vm, bo);
>>       if (ret)
>>           return ERR_PTR(ret);
>>   @@ -1639,7 +1639,7 @@ int xe_managed_bo_reinit_in_vram(struct 
>> xe_device *xe, struct xe_tile *tile, str
>>       if (IS_ERR(bo))
>>           return PTR_ERR(bo);
>>   -    drmm_release_action(&xe->drm, __xe_bo_unpin_map_no_vm, *src);
>> +    devm_release_action(xe->drm.dev, __xe_bo_unpin_map_no_vm, *src);
>>       *src = bo;
>>         return 0;