[PATCH v5 01/23] drm/xe: Error handling in xe_force_wake_get()
Michal Wajdeczko
michal.wajdeczko at intel.com
Thu Sep 26 11:03:04 UTC 2024
On 25.09.2024 20:14, Ghimiray, Himal Prasad wrote:
>
>
> On 25-09-2024 22:50, Michal Wajdeczko wrote:
>>
>>
>> On 25.09.2024 18:36, Ghimiray, Himal Prasad wrote:
>>>
>>>
>>> On 25-09-2024 17:31, Michal Wajdeczko wrote:
>>>>
>>>>
>>>> On 24.09.2024 14:16, Himal Prasad Ghimiray wrote:
>>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>>> increment the reference count for the domain. This ensures that
>>>>> subsequent _get calls do not incorrectly assume the domain is
>>>>> awake. The
>>>>> return value is a mask of domains whose reference counts were
>>>>> incremented, and these domains need to be released using
>>>>> xe_force_wake_put.
>>>>>
>>>>> The caller needs to compare the return value with the input domains to
>>>>> determine the success or failure of the operation and decide
>>>>> whether to
>>>>> continue or return accordingly.
>>>>>
>>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>>
>>>>> v3
>>>>> - Use explicit type for mask (Michal/Badal)
>>>>> - Improve kernel-doc (Michal)
>>>>> - Use unsigned int instead of abusing enum (Michal)
>>>>>
>>>>> v5
>>>>> - Use unsigned int for return (MattB/Badal/Rodrigo)
>>>>> - use xe_gt_WARN for domain awake ack failure (Badal/Rodrigo)
>>>>>
>>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/xe/xe_force_wake.c | 37 ++++++++++++++++++++++
>>>>> +-------
>>>>> drivers/gpu/drm/xe/xe_force_wake.h | 4 ++--
>>>>> 2 files changed, 31 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/
>>>>> xe_force_wake.c
>>>>> index a64c14757c84..d190aa93be90 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>> @@ -150,28 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>> (ffs(tmp__) - 1))) && \
>>>>> domain__->reg_ctl.addr)
>>>>> -int xe_force_wake_get(struct xe_force_wake *fw,
>>>>> - enum xe_force_wake_domains domains)
>>>>> +/**
>>>>> + * xe_force_wake_get() : Increase the domain refcount
>>>>> + * @fw: struct xe_force_wake
>>>>> + * @domains: forcewake domains to get refcount on
>>>>> + *
>>>>> + * This function takes references for the input @domains and wakes
>>>>> them if
>>>>> + * they are asleep.
>>>>
>>>> nit: likely we should also add the note that one shall call the
>>>> xe_force_wake_put() function to decrease refcounts
>>>
>>> Sure.
>>>
>>>>
>>>>> + *
>>>>> + * Return: mask of refcount increased domains.
>>>>
>>>> do we really need to expose implementation detail to the caller?
>>>>
>>>> can't we just treat the returned value as opaque data that just
>>>> needs to
>>>> be passed to the matching xe_force_wake_put(fw, ref) call ?
>>>>
>>>>> If the return value is
>>>>> + * equal to the input parameter @domains, the operation is considered
>>>>> + * successful.
>>>>
>>>> sorry, but I'm still little uncomfortable with such approach, due to a
>>>> mismatch between input parameter type that is enum
>>>> xe_force_wake_domains
>>>> and return type which is unsigned int, as IMO forcing the user to
>>>> compare two different types seems wrong
>>>
>>> Sure, I understand the thought process behind it. Will a helper to do it
>>> is OK. Justifying below for the helper.
>>>
>>>>
>>>> can't we just say that if returned value is zero then no domains were
>>>> waken (which would provide definite answer if single domain was
>>>> requested) ?
>>>
>>> I am concerned about the scenario, where user ends up considering non
>>> zero return of FORCEWAKE_ALL as success.
>>
>> while user can always do something wrong, note that in above statement
>> we didn't say that _all_ requested domains were awake if ref != 0
>>
>>>
>>> Having multiple success or failure condition for same API, based on
>>> input parameter looks bad design to me.
>>
>> it's not multiple, in fact it is unified:
>>
>> 1) ALL_DOMAINS
>>
>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>> if (!ref)
>> return -EIO
>>
>> // explicit check
>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>> ...
>>
>> xe_force_wake_put(fw, ref)
>>
>>
>> 2a) SINGLE_DOMAIN
>>
>> ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
>> if (!ref)
>> return -EIO
>>
>> // explicit check
>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>> ...
>>
>> xe_force_wake_put(fw, ref)
>>
>>
>> and what we did is just optimization for SINGLE_DOMAIN
>>
>> 2b) SINGLE_DOMAIN
>>
>> ref = xe_force_wake_get(fw, SINGLE_DOMAIN)
>> if (!ref)
>> return -EIO
>>
>> BUG_ON(!xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>>
>> xe_force_wake_put(fw, ref)
>>
>>>
>>> ref = xe_force_wake_get(fw, GT)
>>> if (ref) -> means success condn
>>>
>>> whereas
>>>
>>> ref = xe_force_wake_put(fw, ALL)
>>> if (ref) -> doesn't necessarily means successfully awakened all domain.
>>>
>>> User can very easily interpret it wrongly:
>>> ref = xe_force_wake_get(fw, ALL)
>>> if(!ref)
>>> error_handling;
>>>
>>> /* Irrespective of failure or success of xe_force_wake_get code reaches
>>> here */
>>> do_multiple_operations(); /* Needed all fw domains supported on GT to be
>>> awake */
>>>
>>> xe_force_wake_put(fw, ref);
>>>
>>> We have use-cases throughout the driver where user relies on success/
>>> failure for FORCEWAKE_ALL domains.
>>
>> maybe for more consistent API we should have two functions:
>>
>> a) pass only if _all_ domains are awake
>>
>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>> if (!ref)
>> return -EIO
>>
>> // no need for explicit checks
>> BUG_ON(!xe_force_wake_is_awake(fw, FOO_DOMAIN))
>> BUG_ON(!xe_force_wake_is_awake(fw, BAR_DOMAIN))
>> ...
>>
>> xe_force_wake_put(fw, ref)
>>
>> b) fails only if _all_ domains are not awake
>>
>> ref = xe_force_wake_get(fw, ALL_DOMAINS)
>> if (!ref)
>> return -EIO
>>
>> // must do explicit checks
>> if (xe_force_wake_is_awake(fw, SINGLE_DOMAIN))
>> ...
>>
>> xe_force_wake_put(fw, ref)
>>>
>>>>
>>>> and for the xe_force_wake_get(fw, ALL_DOMAINS) case, we can provide
>>>> helper function that will check if specified domain is really awake:
>>>>
>>>> ref = xe_force_wake_put(fw, ALL_DOMAIN)
>>>>
>>>> if (ref) {
>>>> xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
>>>
>>>
>>> I assume this API is to check whether ref has domain in it or not. Will
>>> be good helper to have for such scenarios, Which are currently not there
>>> in driver.
>>
>> there could be two variants, one that checks fw
>>
>> xe_force_wake_is_awake(fw, SINGLE_DOMAIN)
>>
>> other that checks ref
>>
>> xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN)
>>
>> but the only benefit from the latter is that it could be lightweight as
>> otherwise
>>
>> BUG_ON(xe_force_wake_is_awake(fw, SINGLE_DOMAIN) ^
>> xe_force_wake_ref_has_domain(ref, SINGLE_DOMAIN))
>>
>>>
>>>
>>>>
>>>> xe_force_wake_put(fw, ref)
>>>> }
>>>>
>>>
>>> I believe a helper to determine success or failure of force_wake_get,
>>> instead of caller doing it will streamline all the domains and usecases.
>>>
>>> int xe_force_wake_get_status(enum domain, unsigned int fw_ref)
>>> {
>>> return (fw_ref == domain) ? 0 : -ETIMEDOUT;
>>> }
>>
>> I'm not sure we should provide error code, IMO simple bool function will
>> better and caller can decide what to do next
>
>
> -ETIMEDOUT is the actual error code for domain awake acknowledgment
> failure that is why I added it. I am fine with bool and letting caller
> decide.
>
> Thinking more about it, better not to have xe_force_wake_is_awake and
> rely on xe_force_wake_ref_has_domain.
but nested functions like do_operations() may not have access to the
fw_ref, so we should still allow to check domain status without the ref
> fw->awake_domain is global state and could be awaken from anywhere
> doesn't guarantee it is awaken by current force_wake_get(). Can be a
> misused.
true, so lets just define two helpers to cover all use cases:
bool xe_force_wake_ref_has_domain(unsigned int fw_ref, enum domain)
bool xe_force_wake_is_awake(struct xe_fw *fw, enum domain)
>
> Since you recommended using bool. How about ?
>
> bool xe_force_wake_ref_has_domain(enum domain, unsigned int fw_ref)
> { return (fw_ref & domain == domain) ? true : false ;
> }
you don't need ternary operator and fw_ref should be first:
bool xe_force_wake_ref_has_domain(unsigned int fw_ref, enum domain)
>
> works for all scenario:
>
> ref = xe_force_wake_get(fw, SINGLE)
> if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure
>
>
> ref = xe_force_wake_get(fw, ALL)
> !xe_force_wake_ref_has_domain(ALL, ref) -> failure
>
> ref = xe_force_wake_get(fw, ALL)
> if(!xe_force_wake_ref_has_domain(SINGLE, ref)) -> failure
>
>
> If you find this to be ok. Will define this helper and use it to check
> status of force_wake_get.
with fixed/added helpers it's fine with me
>
> BR
> Himal
>
>>
>>>
>>> Usecases:
>>>
>>> A) No error check, just continue in case of failure too.
>>>
>>> ref = xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>>> do_operations();
>>> xe_force_wake_put(fw, ref);
>>
>> yep
>>
>> and inside do_operations() we can always use xe_force_wake_is_awake to
>> check every domain
>>
>>>
>>> B) error check, abort in case of failure.
>>>
>>> ref = xe_force_wake_get(fw, domain /* can be ALL_DOMAIN */);
>>> int err = xe_force_wake_get_status(domain, ref);
>>> if(err) {
>>> xe_force_wake_put(fw, ref);
>>> return err;
>>> }
>>> do_operations();
>>> xe_force_wake_put(fw, ref);
>>
>> nay, too complicated, better:
>>
>> ref = xe_force_wake_get(fw, ALL);
>>
>> err = xe_force_wake_is_awake(fw, ALL) ? do_operations() : -EFATAL;
>>
>> xe_force_wake_put(fw, ref);
>>
>>>
>>> c) get all domain, but check specific domain:
>>>
>>> ref = xe_force_wake_get(fw, ALL_domain);
>>> if (xe_force_wake_get_status(domain, ref))
>>> dmesg_warn( "unable to awake all requested domain \n");
>>
>> IIRC in xe_force_wake_get() there is one WARN already
>>
>>>
>>> if (xe_fwref_has_domain(fw, SINGLE_DOMAIN))
>>> do_operations()
>>>
>>> xe_force_wake_put(fw, ref);
>>>
>>
>> so maybe simpler:
>>
>> ref = xe_force_wake_get(fw, ALL);
>>
>> err = xe_force_wake_is_awake(fw, FOO) ? do_operations() : -EMINOR;
>>
>> xe_force_wake_put(fw, ref);
>>
>>
>>>>> Otherwise, the operation is considered a failure, and
>>>>> + * the caller should handle the failure case, potentially returning
>>>>> + * -ETIMEDOUT.
>>>>> + */
>>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>>> + enum xe_force_wake_domains domains)
>>>>> {
>>>>> struct xe_gt *gt = fw->gt;
>>>>> struct xe_force_wake_domain *domain;
>>>>> - enum xe_force_wake_domains tmp, woken = 0;
>>>>> + unsigned int tmp, ret, awake_rqst = 0, awake_failed = 0;
>>>>> unsigned long flags;
>>>>> - int ret = 0;
>>>>> spin_lock_irqsave(&fw->lock, flags);
>>>>> for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>> if (!domain->ref++) {
>>>>> - woken |= BIT(domain->id);
>>>>> + awake_rqst |= BIT(domain->id);
>>>>> domain_wake(gt, domain);
>>>>> }
>>>>> }
>>>>> - for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>>> - ret |= domain_wake_wait(gt, domain);
>>>>> + for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>>> + if (domain_wake_wait(gt, domain) == 0) {
>>>>> + fw->awake_domains |= BIT(domain->id);
>>>>> + } else {
>>>>> + awake_failed |= BIT(domain->id);
>>>>> + --domain->ref;
>>>>> + }
>>>>> }
>>>>> - fw->awake_domains |= woken;
>>>>> + ret = (domains & ~awake_failed);
>>>>> spin_unlock_irqrestore(&fw->lock, flags);
>>>>> + xe_gt_WARN(gt, awake_failed, "domain%s %#x failed to
>>>>> acknowledgment awake\n",
>>>>> + str_plural(hweight_long(awake_failed)), awake_failed);
>>>>> +
>>>>> return ret;
>>>>> }
>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.h b/drivers/gpu/drm/
>>>>> xe/xe_force_wake.h
>>>>> index a2577672f4e3..6c1ade39139b 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.h
>>>>> @@ -15,8 +15,8 @@ void xe_force_wake_init_gt(struct xe_gt *gt,
>>>>> struct xe_force_wake *fw);
>>>>> void xe_force_wake_init_engines(struct xe_gt *gt,
>>>>> struct xe_force_wake *fw);
>>>>> -int xe_force_wake_get(struct xe_force_wake *fw,
>>>>> - enum xe_force_wake_domains domains);
>>>>> +unsigned int xe_force_wake_get(struct xe_force_wake *fw,
>>>>> + enum xe_force_wake_domains domains);
>>>>> int xe_force_wake_put(struct xe_force_wake *fw,
>>>>> enum xe_force_wake_domains domains);
>>>>>
>>>>
>>>
>>
>
More information about the Intel-xe
mailing list