[PATCH v5 01/23] drm/xe: Error handling in xe_force_wake_get()
Ghimiray, Himal Prasad
himal.prasad.ghimiray at intel.com
Thu Sep 26 11:43:36 UTC 2024
On 26-09-2024 16:33, Michal Wajdeczko wrote:
>
>
> On 25.09.2024 20:14, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 25-09-2024 22:50, Michal Wajdeczko wrote:
>>>
>>>
>>> On 25.09.2024 18:36, Ghimiray, Himal Prasad wrote:
>>>>
>>>>
>>>> On 25-09-2024 17:31, Michal Wajdeczko wrote:
>>>>>
>>>>>
>>>>> On 24.09.2024 14:16, Himal Prasad Ghimiray wrote:
>>>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>>>> increment the reference count for the domain. This ensures that
>>>>>> subsequent _get calls do not incorrectly assume the domain is
>>>>>> awake. The
>>>>>> return value is a mask of domains whose reference counts were
>>>>>> incremented, and these domains need to be released using
>>>>>> xe_force_wake_put.
>>>>>>
>>>>>> The caller needs to compare the return value with the input domains to
>>>>>> determine the success or failure of the operation and decide
>>>>>> whether to
>>>>>> continue or return accordingly.
>>>>>>
>>>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>>>
>>>>>> v3
>>>>>> - Use explicit type for mask (Michal/Badal)
>>>>>> - Improve kernel-doc (Michal)
>>>>>> - Use unsigned int instead of abusing enum (Michal)
>>>>>>
>>>>>> v5
>>>>>> - Use unsigned int for return (MattB/Badal/Rodrigo)
>>>>>> - use xe_gt_WARN for domain awake ack failure (Badal/Rodrigo)
>>>>>>
>>>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/xe/xe_force_wake.c | 37 ++++++++++++++++++++++
>>>>>> +-------
>>>>>> drivers/gpu/drm/xe/xe_force_wake.h | 4 ++--
>>>>>> 2 files changed, 31 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/
>>>>>> xe_force_wake.c
>>>>>> index a64c14757c84..d190aa93be90 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>> @@ -150,28 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>>> (ffs(tmp__) - 1))) && \
>>>>>> domain__->reg_ctl.addr)
>>>>>> -int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>> - enum xe_force_wake_domains domains)
>>>>>> +/**
>>>>>> + * xe_force_wake_get() : Increase the domain refcount
>>>>>> + * @fw: struct xe_force_wake
>>>>>> + * @domains: forcewake domains to get refcount on
>>>>>> + *
>>>>>> + * This function takes references for the input @domains and wakes
>>>>>> them if
>>>>>> + * they are asleep.
>>>>>
>>>>> nit: likely we should also add the note that one shall call the
>>>>> xe_force_wake_put() function to decrease refcounts
>>>>
>>>> Sure.
>>>>
>>>>>
>>>>>> + *
>>>>>> + * Return: mask of refcount increased domains.
>>>>>
>>>>> do we really need to expose implementation detail to the caller?
>>>>>
>>>>> can't we just treat the returned value as opaque data that just
>>>>> needs to
>>>>> be passed to the matching xe_force_wake_put(fw, ref) call ?
>>>>>
>>>>>> If the return value is
>>>>>> + * equal to the input parameter @domains, the operation is considered
>>>>>> + * successful.
>>>>>
>>>>> sorry, but I'm still little uncomfortable with such approach, due to a
>>>>> mismatch between input parameter type that is enum
>>>>> xe_force_wake_domains
>>>>> and return type which is unsigned int, as IMO forcing the user to
>>>>> compare two different types seems wrong
>>>>
>>>> Sure, I understand the thought process behind it. Will a helper to do it
>>>> is OK. Justifying below for the helper.
>>>>
>>>>>
>>>>> can't we just say that if returned value is zero then no domains were
>>>>> waken (which would provide definite answer if single domain was
>>>>> requested) ?
>>>>
>>>> I am concerned about the scenario, where user ends up considering non
>>>> zero return of FORCEWAKE_ALL as success.
>>>
>>> while user can always do something wrong, note that in above statement
>>> we didn't say that _all_ requested domains were awake if ref != 0
>>>
>>>>
>>>> Having multiple success or failure condition for same API, based on
>>>> input parameter looks bad design to me.
>>>
>>> it's not multiple, in fact it is unified:
>>>
>>> 1) ALL_DOMAINS
>>>
>>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>>> if (!ref)
>>> return -EIO
>>>
>>> // explicit check
>>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>>> ...
>>>
>>> xe_force_wake_put(fw, ref)
>>>
>>>
>>> 2a) SINGLE_DOMAIN
>>>
>>> ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
>>> if (!ref)
>>> return -EIO
>>>
>>> // explicit check
>>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>>> ...
>>>
>>> xe_force_wake_put(fw, ref)
>>>
>>>
>>> and what we did is just optimization for SINGLE_DOMAIN
>>>
>>> 2b) SINGLE_DOMAIN
>>>
>>> ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
>>> if (!ref)
>>> return -EIO
>>>
>>> BUG_ON(!xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>>>
>>> xe_force_wake_put(fw, ref)
>>>
>>>>
>>>> ref = xe_force_wake_get(fw, GT)
>>>> if (ref) -> means success condn
>>>>
>>>> whereas
>>>>
>>>> ref = xe_force_wake_put(fw, ALL)
>>>> if (ref) -> doesn't necessarily means successfully awakened all domain.
>>>>
>>>> User can very easily interpret it wrongly:
>>>> ref = xe_force_wake_get(fw, ALL)
>>>> if(!ref)
>>>> error_handling;
>>>>
>>>> /* Irrespective of failure or success of xe_force_wake_get code reaches
>>>> here */
>>>> do_multiple_operations(); /* Needed all fw domains supported on GT to be
>>>> awake */
>>>>
>>>> xe_force_wake_put(fw, ref);
>>>>
>>>> We have use-cases throughout the driver where user relies on success/
>>>> failure for FORCEWAKE_ALL domains.
>>>
>>> maybe for more consistent API we should have two functions:
>>>
>>> a) pass only if _all_ domains are awake
>>>
>>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>>> if (!ref)
>>> return -EIO
>>>
>>> // no need for explicit checks
>>> BUG_ON(!xe_force_wake_is_awake(fw, FOO_DOMAIN))
>>> BUG_ON(!xe_force_wake_is_awake(fw, BAR_DOMAIN))
>>> ...
>>>
>>> xe_force_wake_put(fw, ref)
>>>
>>> b) fails only if _all_ domains are not awake
>>>
>>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>>> if (!ref)
>>> return -EIO
>>>
>>> // must do explicit checks
>>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>>> ...
>>>
>>> xe_force_wake_put(fw, ref)
>>>>
>>>>>
>>>>> and for the xe_force_wake_get(fw, ALL_DOMAINS) case, we can provide
>>>>> helper function that will check if specified domain is really awake:
>>>>>
>>>>> ref = xe_force_wake_put(fw, ALL_DOMAIN)
>>>>>
>>>>> if (ref) {
>>>>> xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
>>>>
>>>>
>>>> I assume this API is to check whether ref has domain in it or not. Will
>>>> be good helper to have for such scenarios, Which are currently not there
>>>> in driver.
>>>
>>> there could be two variants, one that checks fw
>>>
>>> xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
>>>
>>> other that checks ref
>>>
>>> xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN)
>>>
>>> but the only benefit from the latter is that it could be lightweight as
>>> otherwise
>>>
>>> BUG_ON(xe_force_wake_is_awake(fw, SINGLE_DOMAIN) ^
>>> xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN))
>>>
>>>>
>>>>
>>>>>
>>>>> xe_force_wake_put(fw, ref)
>>>>> }
>>>>>
>>>>
>>>> I believe a helper to determine success or failure of force_wake_get,
>>>> instead of caller doing it will streamline all the domains and usecases.
>>>>
>>>> int xe_force_wake_get_status(enum domain, unsigned int fw_ref)
>>>> {
>>>> return (fw_ref == domain) ? 0 : -ETIMEDOUT;
>>>> }
>>>
>>> I'm not sure we should provide error code, IMO simple bool function will
>>> better and caller can decide what to do next
>>
>>
>> -ETIMEDOUT is the actual error code for domain awake acknowledgment
>> failure that is why I added it. I am fine with bool and letting caller
>> decide.
>>
>> Thinking more about it, better not to have xe_force_wake_is_awake and
>> rely on xe_force_wake_ref_has_domain.
>
> but nested functions like do_operations() may not have access to the
> fw_ref, so we should still allow to check domain status without the ref
>
That is true, In that case isn't it always safe to call another
xe_force_wake_get/put instead of creating helper like
xe_force_wake_is_awake(which is guaranteed to be unsafe in nested
functions) ?
>> fw->awake_domain is global state and could be awaken from anywhere
>> doesn't guarantee it is awaken by current force_wake_get(). Can be a
>> misused.
>
> true, so lets just define two helpers to cover all use cases:
>
> bool xe_force_wake_ref_has_domain(unsigned int fw_ref, enum domain)
> bool xe_force_wake_is_awake(struct xe_fw *fw, enum domain)
I find this helper risky, but if you insist will go ahead with it.
>
>>
>> Since you recommended using bool. How about ?
>>
>> bool xe_force_wake_ref_has_domain(enum domain, unsigned int fw_ref)
>> { return (fw_ref & domain == domain) ? true : false ;
>> }
>
> you don't need ternary operator and fw_ref should be first:
Sure. Makes sense.
>
> bool xe_force_wake_ref_has_domain(unsigned int fw_ref, enum domain)
>
>>
>> works for all scenario:
>>
>> ref = xe_force_wake_get(fw, SINGLE)
>> if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure
>>
>>
>> ref = xe_force_wake_get(fw, ALL)
>> !xe_force_wake_ref_has_domain(ALL, ref) -> failure
>>
>> ref = xe_force_wake_get(fw, ALL)
>> if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure
>>
>>
>> If you find this to be ok. Will define this helper and use it to check
>> status of force_wake_get.
>
> with fixed/added helpers it's fine with me
Thanks for confirming this.
>
>>
>> BR
>> Himal
>>
>>>
>>>>
>>>> Usecases:
>>>>
>>>> A) No error check, just continue in case of failure too.
>>>>
>>>> ref = xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>>>> do_operations();
>>>> xe_force_wake_put(fw, ref);
>>>
>>> yep
>>>
>>> and inside do_operations() we can always use xe_force_wake_is_awake to
>>> check every domain
>>>
>>>>
>>>> B) error check, abort in case of failure.
>>>>
>>>> ref = xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>>>> int err = xe_force_wake_get_status(domain, ref);
>>>> if(err) {
>>>> xe_force_wake_put(fw, ref);
>>>> return err;
>>>> }
>>>> do_operations();
>>>> xe_force_wake_put(fw, ref);
>>>
>>> nay, too complicated, better:
>>>
>>> ref = xe_force_wake_get(fw, ALL);
>>>
>>> err = xe_force_wake_is_awake(fw, ALL) ? do_operations() : -EFATAL;
>>>
>>> xe_force_wake_put(fw, ref);
>>>
>>>>
>>>> c) get all domain, but check specific domain:
>>>>
>>>> ref = xe_force_wake_get(fw, ALL_domain);
>>>> if (xe_force_wake_get_status(domain, ref))
>>>> dmesg_warn( "unable to awake all requested domain \n");
>>>
>>> IIRC in xe_force_wake_get() there is one WARN already
>>>
>>>>
>>>> if (xe_fwref_has_domain(fw, SINGLE_DOMAIN))
>>>> do_operations()
>>>>
>>>> xe_force_wake_put(fw, ref);
>>>>
>>>
>>> so maybe simpler:
>>>
>>> ref = xe_force_wake_get(fw, ALL);
>>>
>>> err = xe_force_wake_is_awake(fw, FOO) ? do_operations() : -EMINOR;
>>>
>>> xe_force_wake_put(fw, ref);
>>>
>>>
>>>>>> Otherwise, the operation is considered a failure, and
>>>>>> + * the caller should handle the failure case, potentially returning
>>>>>> + * -ETIMEDOUT.
>>>>>> + */
>>>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>> + enum xe_force_wake_domains domains)
>>>>>> {
>>>>>> struct xe_gt *gt = fw->gt;
>>>>>> struct xe_force_wake_domain *domain;
>>>>>> - enum xe_force_wake_domains tmp, woken = 0;
>>>>>> + unsigned int tmp, ret, awake_rqst = 0, awake_failed = 0;
>>>>>> unsigned long flags;
>>>>>> - int ret = 0;
>>>>>> spin_lock_irqsave(&fw->lock, flags);
>>>>>> for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>>> if (!domain->ref++) {
>>>>>> - woken |= BIT(domain->id);
>>>>>> + awake_rqst |= BIT(domain->id);
>>>>>> domain_wake(gt, domain);
>>>>>> }
>>>>>> }
>>>>>> - for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>>>> - ret |= domain_wake_wait(gt, domain);
>>>>>> + for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>>>> + if (domain_wake_wait(gt, domain) == 0) {
>>>>>> + fw->awake_domains |= BIT(domain->id);
>>>>>> + } else {
>>>>>> + awake_failed |= BIT(domain->id);
>>>>>> + --domain->ref;
>>>>>> + }
>>>>>> }
>>>>>> - fw->awake_domains |= woken;
>>>>>> + ret = (domains & ~awake_failed);
>>>>>> spin_unlock_irqrestore(&fw->lock, flags);
>>>>>> + xe_gt_WARN(gt, awake_failed, "domain%s %#x failed to
>>>>>> acknowledgment awake\n",
>>>>>> + str_plural(hweight_long(awake_failed)), awake_failed);
>>>>>> +
>>>>>> return ret;
>>>>>> }
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.h b/drivers/gpu/drm/
>>>>>> xe/xe_force_wake.h
>>>>>> index a2577672f4e3..6c1ade39139b 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.h
>>>>>> @@ -15,8 +15,8 @@ void xe_force_wake_init_gt(struct xe_gt *gt,
>>>>>> struct xe_force_wake *fw);
>>>>>> void xe_force_wake_init_engines(struct xe_gt *gt,
>>>>>> struct xe_force_wake *fw);
>>>>>> -int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>> - enum xe_force_wake_domains domains);
>>>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>> + enum xe_force_wake_domains domains);
>>>>>> int xe_force_wake_put(struct xe_force_wake *fw,
>>>>>> enum xe_force_wake_domains domains);
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the Intel-xe
mailing list