[PATCH v5 01/23] drm/xe: Error handling in xe_force_wake_get()

Ghimiray, Himal Prasad himal.prasad.ghimiray at intel.com
Wed Sep 25 18:14:02 UTC 2024



On 25-09-2024 22:50, Michal Wajdeczko wrote:
> 
> 
> On 25.09.2024 18:36, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 25-09-2024 17:31, Michal Wajdeczko wrote:
>>>
>>>
>>> On 24.09.2024 14:16, Himal Prasad Ghimiray wrote:
>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>> increment the reference count for the domain. This ensures that
>>>> subsequent _get calls do not incorrectly assume the domain is awake. The
>>>> return value is a mask of domains whose reference counts were
>>>> incremented, and these domains need to be released using
>>>> xe_force_wake_put.
>>>>
>>>> The caller needs to compare the return value with the input domains to
>>>> determine the success or failure of the operation and decide whether to
>>>> continue or return accordingly.
>>>>
>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>
>>>> v3
>>>> - Use explicit type for mask (Michal/Badal)
>>>> - Improve kernel-doc (Michal)
>>>> - Use unsigned int instead of abusing enum (Michal)
>>>>
>>>> v5
>>>> - Use unsigned int for return (MattB/Badal/Rodrigo)
>>>> - use xe_gt_WARN for domain awake ack failure (Badal/Rodrigo)
>>>>
>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_force_wake.c | 37 +++++++++++++++++++++++-------
>>>>    drivers/gpu/drm/xe/xe_force_wake.h |  4 ++--
>>>>    2 files changed, 31 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/
>>>> xe_force_wake.c
>>>> index a64c14757c84..d190aa93be90 100644
>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>> @@ -150,28 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>                         (ffs(tmp__) - 1))) && \
>>>>                         domain__->reg_ctl.addr)
>>>>    -int xe_force_wake_get(struct xe_force_wake *fw,
>>>> -              enum xe_force_wake_domains domains)
>>>> +/**
>>>> + * xe_force_wake_get() : Increase the domain refcount
>>>> + * @fw: struct xe_force_wake
>>>> + * @domains: forcewake domains to get refcount on
>>>> + *
>>>> + * This function takes references for the input @domains and wakes
>>>> them if
>>>> + * they are asleep.
>>>
>>> nit: likely we should also add the note that one shall call the
>>> xe_force_wake_put() function to decrease refcounts
>>
>> Sure.
>>
>>>
>>>> + *
>>>> + * Return: mask of refcount increased domains.
>>>
>>> do we really need to expose implementation detail to the caller?
>>>
>>> can't we just treat the returned value as opaque data that just needs to
>>> be passed to the matching xe_force_wake_put(fw, ref) call ?
>>>
>>>> If the return value is
>>>> + * equal to the input parameter @domains, the operation is considered
>>>> + * successful.
>>>
>>> sorry, but I'm still little uncomfortable with such approach, due to a
>>> mismatch between input parameter type that is enum xe_force_wake_domains
>>> and return type which is unsigned int, as IMO forcing the user to
>>> compare two different types seems wrong
>>
>> Sure, I understand the thought process behind it. Will a helper to do it
>> is OK. Justifying below for the helper.
>>
>>>
>>> can't we just say that if returned value is zero then no domains were
>>> waken (which would provide definite answer if single domain was
>>> requested) ?
>>
>> I am concerned  about the scenario, where user ends up considering non
>> zero return of FORCEWAKE_ALL as success.
> 
> while user can always do something wrong, note that in above statement
> we didn't say that _all_ requested domains were awake if ref != 0
> 
>>
>> Having multiple success or failure condition for same API, based on
>> input parameter looks bad design to me.
> 
> it's not multiple, in fact it is unified:
> 
> 1) ALL_DOMAINS
> 
> 	ref = xe_force_wake_get(fw, ALL_DOMAINS)
> 	if (!ref)
> 		return -EIO
> 
> 	// explicit check
> 	if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
> 		...
> 
> 	xe_force_wake_put(fw, ref)
> 
> 
> 2a) SINGLE_DOMAIN
> 
> 	ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
> 	if (!ref)
> 		return -EIO
> 
> 	// explicit check
> 	if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
> 		...
> 
> 	xe_force_wake_put(fw, ref)
> 
> 
> and what we did is just optimization for SINGLE_DOMAIN
> 
> 2b) SINGLE_DOMAIN
> 
> 	ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
> 	if (!ref)
> 		return -EIO
> 
> 	BUG_ON(!xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
> 
> 	xe_force_wake_put(fw, ref)
> 
>>
>> ref = xe_force_wake_get(fw, GT)
>> if (ref) -> means success condn
>>
>> whereas
>>
>> ref = xe_force_wake_put(fw, ALL)
>> if (ref) -> doesn't necessarily means successfully awakened all domain.
>>
>> User can very easily interpret it wrongly:
>> ref = xe_force_wake_get(fw, ALL)
>> if(!ref)
>>      error_handling;
>>
>> /* Irrespective of failure or success of xe_force_wake_get code reaches
>> here */
>> do_multiple_operations(); /* Needed all fw domains supported on GT to be
>> awake */
>>
>> xe_force_wake_put(fw, ref);
>>
>> We have use-cases throughout the driver where user relies on success/
>> failure for FORCEWAKE_ALL domains.
> 
> maybe for more consistent API we should have two functions:
> 
> a) pass only if _all_ domains are awake
> 
> 	ref = xe_force_wake_get(fw, ALL_DOMAINS)
> 	if (!ref)
> 		return -EIO
> 
> 	// no need for explicit checks
> 	BUG_ON(!xe_force_wake_is_awake(fw, FOO_DOMAIN))
> 	BUG_ON(!xe_force_wake_is_awake(fw, BAR_DOMAIN))
> 	...
> 
> 	xe_force_wake_put(fw, ref)
> 
> b) fails only if _all_ domains are not awake
> 
> 	ref = xe_force_wake_get(fw, ALL_DOMAINS)
> 	if (!ref)
> 		return -EIO
> 
> 	// must do explicit checks
> 	if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
> 		...
> 
> 	xe_force_wake_put(fw, ref)
>>
>>>
>>> and for the xe_force_wake_get(fw, ALL_DOMAINS) case, we can provide
>>> helper function that will check if specified domain is really awake:
>>>
>>>      ref = xe_force_wake_put(fw, ALL_DOMAIN)
>>>
>>>      if (ref) {
>>>          xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
>>
>>
>> I assume this API is to check whether ref has domain in it or not. Will
>> be good helper to have for such scenarios, Which are currently not there
>> in driver.
> 
> there could be two variants, one that checks fw
> 
> 	xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
> 
> other that checks ref
> 
> 	xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN)
> 
> but the only benefit from the latter is that it could be lightweight as
> otherwise
> 
> 	BUG_ON(xe_force_wake_is_awake(fw, SINGLE_DOMAIN) ^
> 	       xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN))
> 
>>
>>
>>>
>>>          xe_force_wake_put(fw, ref)
>>>      }
>>>
>>
>> I believe a helper to determine success or failure of force_wake_get,
>> instead of caller doing it will streamline all the domains and usecases.
>>
>> int xe_force_wake_get_status(enum domain, unsigned int fw_ref)
>> {
>>    return  (fw_ref == domain) ? 0 : -ETIMEDOUT;
>> }
> 
> I'm not sure we should provide error code, IMO simple bool function will
> better and caller can decide what to do next


-ETIMEDOUT is the actual error code for domain awake acknowledgment 
failure that is why I added it. I am fine with bool and letting caller 
decide.

Thinking more about it, better not to have xe_force_wake_is_awake and 
rely on xe_force_wake_ref_has_domain.
fw->awake_domain is global state and could be awaken from anywhere 
doesn't guarantee it is awaken by current force_wake_get(). Can be a 
misused.

Since you recommended using bool. How about ?

bool xe_force_wake_ref_has_domain(enum domain, unsigned int fw_ref)
{ return (fw_ref & domain == domain) ?  true : false ;
}

works for all scenario:

ref = xe_force_wake_get(fw, SINGLE)
if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure


ref = xe_force_wake_get(fw, ALL)
!xe_force_wake_ref_has_domain(ALL, ref) -> failure

ref = xe_force_wake_get(fw, ALL)
if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure


If you find this to be ok. Will define this helper and use it to check 
status of force_wake_get.

BR
Himal

> 
>>
>> Usecases:
>>
>> A) No error check, just continue in case of failure too.
>>
>> ref =  xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>> do_operations();
>> xe_force_wake_put(fw, ref);
> 
> yep
> 
> and inside do_operations() we can always use xe_force_wake_is_awake to
> check every domain
> 
>>
>> B) error check, abort in case of failure.
>>
>> ref =  xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>> int err = xe_force_wake_get_status(domain, ref);
>> if(err) {
>>       xe_force_wake_put(fw, ref);
>>           return err;
>>          }
>> do_operations();
>> xe_force_wake_put(fw, ref);
> 
> nay, too complicated, better:
> 
> ref = xe_force_wake_get(fw, ALL);
> 
> err = xe_force_wake_is_awake(fw, ALL) ? do_operations() : -EFATAL;
> 
> xe_force_wake_put(fw, ref);
> 
>>
>> c) get all domain, but check specific domain:
>>
>> ref =  xe_force_wake_get(fw, ALL_domain);
>>     if (xe_force_wake_get_status(domain, ref))
>>          dmesg_warn( "unable to awake all requested domain \n");
> 
> IIRC in xe_force_wake_get() there is one WARN already
> 
>>
>>      if (xe_fwref_has_domain(fw, SINGLE_DOMAIN))
>>          do_operations()
>>
>>     xe_force_wake_put(fw, ref);
>>
> 
> so maybe simpler:
> 
> ref = xe_force_wake_get(fw, ALL);
> 
> err = xe_force_wake_is_awake(fw, FOO) ? do_operations() : -EMINOR;
> 
> xe_force_wake_put(fw, ref);
> 
> 
>>>> Otherwise, the operation is considered a failure, and
>>>> + * the caller should handle the failure case, potentially returning
>>>> + * -ETIMEDOUT.
>>>> + */
>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>> +                   enum xe_force_wake_domains domains)
>>>>    {
>>>>        struct xe_gt *gt = fw->gt;
>>>>        struct xe_force_wake_domain *domain;
>>>> -    enum xe_force_wake_domains tmp, woken = 0;
>>>> +    unsigned int tmp, ret, awake_rqst = 0, awake_failed = 0;
>>>>        unsigned long flags;
>>>> -    int ret = 0;
>>>>          spin_lock_irqsave(&fw->lock, flags);
>>>>        for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>            if (!domain->ref++) {
>>>> -            woken |= BIT(domain->id);
>>>> +            awake_rqst |= BIT(domain->id);
>>>>                domain_wake(gt, domain);
>>>>            }
>>>>        }
>>>> -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>> -        ret |= domain_wake_wait(gt, domain);
>>>> +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>> +        if (domain_wake_wait(gt, domain) == 0) {
>>>> +            fw->awake_domains |= BIT(domain->id);
>>>> +        } else {
>>>> +            awake_failed |= BIT(domain->id);
>>>> +            --domain->ref;
>>>> +        }
>>>>        }
>>>> -    fw->awake_domains |= woken;
>>>> +    ret = (domains & ~awake_failed);
>>>>        spin_unlock_irqrestore(&fw->lock, flags);
>>>>    +    xe_gt_WARN(gt, awake_failed, "domain%s %#x failed to
>>>> acknowledgment awake\n",
>>>> +           str_plural(hweight_long(awake_failed)), awake_failed);
>>>> +
>>>>        return ret;
>>>>    }
>>>>    diff --git a/drivers/gpu/drm/xe/xe_force_wake.h b/drivers/gpu/drm/
>>>> xe/xe_force_wake.h
>>>> index a2577672f4e3..6c1ade39139b 100644
>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.h
>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.h
>>>> @@ -15,8 +15,8 @@ void xe_force_wake_init_gt(struct xe_gt *gt,
>>>>                   struct xe_force_wake *fw);
>>>>    void xe_force_wake_init_engines(struct xe_gt *gt,
>>>>                    struct xe_force_wake *fw);
>>>> -int xe_force_wake_get(struct xe_force_wake *fw,
>>>> -              enum xe_force_wake_domains domains);
>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>> +                   enum xe_force_wake_domains domains);
>>>>    int xe_force_wake_put(struct xe_force_wake *fw,
>>>>                  enum xe_force_wake_domains domains);
>>>>    
>>>
>>
> 



More information about the Intel-xe mailing list