[Intel-xe] [PATCH] drm/xe/engine: add missing rpm for bind engines

Wed Jul 26 09:03:39 UTC 2023

On 25/07/2023 17:51, Rodrigo Vivi wrote:
> On Tue, Jul 25, 2023 at 04:11:41PM +0100, Matthew Auld wrote:
>> On 25/07/2023 15:07, Rodrigo Vivi wrote:
>>> On Tue, Jul 25, 2023 at 12:01:17PM +0100, Matthew Auld wrote:
>>>> Bind engines need to use the migration vm, however we don't have any rpm
>>>> for such a vm, otherwise the kernel would prevent rpm suspend-resume.
>>>> There are two issues here, first is the actual engine create which needs
>>>> to touch the lrc, but since that is in VRAM we trigger loads of missing
>>>> mem_access asserts.
>>>
>>> with this in mind, should we really create the new ENGINE_FLAG_HOLD_RPM,
>>> or should we simply use the existent ENGINE_FLAG_VM?
>>
>> Sure, I can do that instead.
> 
> What I'm not so sure is about the extra case that that would bring:
> 
> at xe_vm_create()
> if (!(flags & XE_VM_FLAG_MIGRATION))

I think in that case the vm is holding the rpm, but looking at the 
destroy path the bind engines vs final vm destroy could happen in any 
order since it appears to be async. The engine destroy gets pushed 
through the scheduler or something which has its own kthread, and the vm 
destroy is pushed onto workqueue. So I think we do want an explicit rpm 
ref there also.

> 
>>
>>>
>>>> The second issue is when destroying the actual
>>>> engine, which requires GuC CT to deregister the context.
>>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/499
>>>
>>> I wonder if we should get the mem_acceess from inside the ct path
>>> to be sure?
>>
>> Do you mean calling xe_device_mem_access_get() in ct_send()? I think that
>> will give various locking issues since any locks the caller is holding you
>> then can't ever grab in the suspend/resume callbacks without deadlocking.
>> And the callers of ct_send() are all over the place, and some are quite deep
>> down. Also some callers expect some future CT response so holding in
>> ct_send() is only one side, and would need to be kept held over to the wait
>> side.
> 
> hmm indeed! And I think we had already discussed that in the past...
> sorry for the noise
> 
>>
>>>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_engine.c       | 20 ++++++++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_engine_types.h |  1 +
>>>>    2 files changed, 21 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_engine.c b/drivers/gpu/drm/xe/xe_engine.c
>>>> index 59e0a9e085ba..dba71f53e53e 100644
>>>> --- a/drivers/gpu/drm/xe/xe_engine.c
>>>> +++ b/drivers/gpu/drm/xe/xe_engine.c
>>>> @@ -76,6 +76,17 @@ static struct xe_engine *__xe_engine_create(struct xe_device *xe,
>>>>    	if (err)
>>>>    		goto err_lrc;
>>>> +	/*
>>>> +	 * Normally the user vm holds an rpm ref to keep the device awake, and
>>>> +	 * the context holds a ref for the vm, however for some engines we use
>>>> +	 * the kernels migrate vm underneath which offers no such rpm ref. Make
>>>> +	 * sure we keep a ref here, so we can perform GuC CT actions when
>>>> +	 * needed. Caller is expected to have already grabbed the rpm ref
>>>> +	 * outside any sensitive locks.
>>>> +	 */
>>>> +	if (e->flags & ENGINE_FLAG_HOLD_RPM)
>>>> +		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
>>>> +
>>>>    	return e;
>>>>    err_lrc:
>>>> @@ -152,6 +163,8 @@ void xe_engine_fini(struct xe_engine *e)
>>>>    		xe_lrc_finish(e->lrc + i);
>>>>    	if (e->vm)
>>>>    		xe_vm_put(e->vm);
>>>> +	if (e->flags & ENGINE_FLAG_HOLD_RPM)
>>>> +		xe_device_mem_access_put(gt_to_xe(e->gt));
>>>>    	kfree(e);
>>>>    }
>>>> @@ -560,14 +573,21 @@ int xe_engine_create_ioctl(struct drm_device *dev, void *data,
>>>>    			if (XE_IOCTL_DBG(xe, !hwe))
>>>>    				return -EINVAL;
>>>> +			/* The migration vm doesn't hold rpm ref */
>>>> +			xe_device_mem_access_get(xe);
>>>> +
>>>>    			migrate_vm = xe_migrate_get_vm(gt_to_tile(gt)->migrate);
>>>>    			new = xe_engine_create(xe, migrate_vm, logical_mask,
>>>>    					       args->width, hwe,
>>>> +					       ENGINE_FLAG_HOLD_RPM |
>>>>    					       ENGINE_FLAG_PERSISTENT |
>>>>    					       ENGINE_FLAG_VM |
>>>>    					       (id ?
>>>>    					       ENGINE_FLAG_BIND_ENGINE_CHILD :
>>>>    					       0));
>>>> +
>>>> +			xe_device_mem_access_put(xe); /* now held by engine */
>>>> +
>>>>    			xe_vm_put(migrate_vm);
>>>>    			if (IS_ERR(new)) {
>>>>    				err = PTR_ERR(new);
>>>> diff --git a/drivers/gpu/drm/xe/xe_engine_types.h b/drivers/gpu/drm/xe/xe_engine_types.h
>>>> index 36bfaeec23f4..a3867e4db0bb 100644
>>>> --- a/drivers/gpu/drm/xe/xe_engine_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_engine_types.h
>>>> @@ -59,6 +59,7 @@ struct xe_engine {
>>>>    #define ENGINE_FLAG_VM			BIT(4)
>>>>    #define ENGINE_FLAG_BIND_ENGINE_CHILD	BIT(5)
>>>>    #define ENGINE_FLAG_WA			BIT(6)
>>>> +#define ENGINE_FLAG_HOLD_RPM		BIT(7)
>>>>    	/**
>>>>    	 * @flags: flags for this engine, should statically setup aside from ban
>>>> -- 
>>>> 2.41.0
>>>>